前言#

不开心，不想说话……让我们直接进入这段有趣的学习之旅吧，试试两个月，甚至更短的时间内解决掉这门又臭又长的课程。

Bits, Bytes, and Integers#

Representation information as bits#

Everything is bits#

Each bit is 0 or 1#

By encoding/interpreting sets of bits in various ways#

Computers determine what to do (instructions)
… and represent and manipulate numbers, sets, strings, etc…

Why bits ? Electronic Implementation#

Easy to store with bitsable elements
Reliably transmitted on noisy and inaccurate wires
Easy to indicate low level/high level

Bit-level manipulations#

Representing & Manipulating Sets#

Representing Sets#

We can use bits to determine whether an element belongs to a set. Say each bit have index, which corresponds to a number from zero. If a bit is set to 1, it means the corresponding index (i.e., the element) is in the set. Otherwise, its not.

In a more mathematical representation way:

Width $w$ bit vector represents subsets subsets of $\{0,\ ...,\ w-1\}$
$a_{j} =1$ if $j\in A$

Sets Operations#

& is related to Intersection
| is related to Union
~ is related to Complement
^ is related to Symmetric difference

Shift Operations#

Left Shift: `x << y`#

Shift bit-vector x left y positions
- Throw away extra bits on left
Fill with 0’s on right

Right Shift: `x >> y`#

Shift bit-vector x right y positions
- Throw away extra bits on right
Logical shift
- Fill with 0’s on left
Arithmetic shift
- Replicate most significant bit on left

Undefined Behaviour#

Shift amount $<$ 0
Shift amount $\geqslant$ word size
Left shift a signed value

Integers#

Representation: unsigned and signed#

Unsigned#

$\displaystyle B2U( X) =\sum _{i=0}^{w-1} x_{i} \cdot 2^{i}$

Two’s Complement#

$\displaystyle B2T( X) =-x_{w-1} \cdot 2^{w-1} +\sum _{i=0}^{w-2} x_{i} \cdot 2^{i}$

Invert mappings#

$U2B( x) =B2U^{-1}( x)$
$T2B( x) =B2T^{-1}( x)$

Numeric Ranges#

Unsigned Values
- $\displaystyle UMin\ =\ 0$
- $\displaystyle UMax\ =\ 2^{w} -1$
Two’s Complement Values
- $\displaystyle TMin = -2^{w-1}$
- $\displaystyle TMax\ =\ 2^{w-1} -1$

TIP
In C, these ranges are declared in limits.h. E.g., ULONG_MAX, LONG_MAX, LONG_MIN. Values are platform specific.

Observations#

$|TMin|=TMax+1$ $∣ TM in ∣ = TM a x + 1$
- Asymmetric range (Every positive value can be represented as a negative value, but $TMin$ cannot be represented as a positive value)
$UMax\ =\ 2\cdot TMax+1$

Difference between Unsigned & Signed Numeric Values#

The difference between Unsigned & Signed Numeric Values is $2^{w}$ .

For example, if you want convert a unsigned numeric value to its signed form, just use this value minus $2^{w}$ , or if you want convert a signed numeric value to its unsigned form, you plus $2^{w}$ .

Conversion, casting#

Mappings between unsigned and two’s complement numbers keep bit representations and reinterpret.

For example, casting a signed value to its unsigned form, the most significant bit from large negative weight becomes to large positive weight and vice versa.

Constants#

By default are considered to be signed integers
Unsigned if have “U” as suffix

Casting#

Explicit casting between signed & unsigned same as $U2T$ and $T2U$
Implicit casting also occurs via assignments and procedure call (assignments will casting to lhs’s type)

Expression Evaluation#

If there is a mix of unsigned and signed in single expression, signed values implicitly cast to unsigned
Including comparison operations <, >, ==, <=, >

Constants 1	Constants 2	Relation	Evaluation
`0`	`0U`	==	unsigned
`-1`	`0`	<	signed
`-1`	`0U`	>	unsigned
`2147483647`	`-2147483647-1`	>	signed
`2147483647U`	`-2147483647-1`	<	unsigned
`-1`	`-2`	>	signed
`(unsigned)-1`	`-2`	>	unsigned
`2147483647`	`2147483648U`	<	unsigned
`2147483647`	`(int)2147483648U`	>	signed

Expanding, truncating#

Sign Extension#

Make $k$ copies of sign bit
$\displaystyle X\prime =X_{w-1} ,...,X_{w-1} ,X_{w-1} ,X_{w-2} ,...,X_{0}$

WARNING
Converting from smaller to larger integer data type. C automatically performs sign extension.

Expanding (e.g., short int to int)#

Unsigned: zeros added
Signed: sign extension
Both yield expected result

Truncating (e.g., unsigned to unsigned short)#

Unsigned/signed: bits are truncated
Result reinterpreted
Unsigned: mod operation
Signed: similar to mod
For small numbers yields expected behavior

Addition, negation, multiplication, shifting#

Unsigned Addition#

Standard Addition Function
- Ignore carry output
Implements Modular Arithmetic
- $\displaystyle UAdd_{w}( u,v) =( u+v)\bmod 2^{w}$

Visualizing (Mathematical) Integer Addition#

Integer Addition
- 4-bit integers $u,v$
- Compute true sum $Add_{4}( u,v)$
- Values increase linearly with $u$ and $v$
- Forms planar surface

Visualizing Unsigned Addition#

Wraps Around
- If true sum $\geqslant 2^{w}$
- At most once

Two’s Complement Addition#

$TAdd$ and $UAdd$ have Identical Bit-level Behaviour

Signed vs. Unsigned Addition in C:

1
int s, t, u, v;
2

3
s = (int)((unsigned)u + (unsigned)v);
4
t = u + v;
5

6
// will give s == t

TAdd Overflow#

Functionality
- True sum requires $w+1$ bits
- Drop off MSB
- Treat remaining bits as two’s complement integer

Visualizing Two’s Complement Addition#

Values
- 4-bit two’s comp.
- Range from $-8$ to $+7$
Wraps Around
- If sum $\geqslant 2^{w-1}$ $⩾ 2^{w - 1}$
  - Becomes negative
  - At most once
- If sum $< -2^{w-1}$ $< - 2^{w - 1}$
  - Becomes positive
  - At most once

Multiplication#

Goal: Computing Product of $w$ $w$ -bit numbers $x,y$ $x, y$
- Either signed or unsigned
But, exact results can be bigger than $w$ $w$ bits
- Unsigned: up to $2w$ $2 w$ bits
  - Result range: $0\leqslant x\cdot y\leqslant \left( 2^{w} -1\right)^{2} =2^{2w} -2^{w+1} +1$
- Two’s complement min (negative): Up to $2w-1$ $2 w - 1$ bits
  - Result range: $x\cdot y\geqslant \left( -2^{w-1}\right) \cdot \left( 2^{w-1} -1\right) =-2^{2w-2} +2^{w-1}$
- Two’s complement max (positive): Up to $2w$ $2 w$ bits, but only for $( TMin_{w})^{2}$ $(TM i n_{w})^{2}$
  - Result range: $x\cdot y\leqslant \left( -2^{w-1}\right)^{2} =2^{2w-2}$
So, maintaining exact results…
- would need to keep expanding word size with each product computed (exhaust memory faster)
- is done in software, if needed
  - e.g., by “arbitrary precision” arithmetic packages

Unsigned Multiplication in C#

Standard Multiplication Function
- Ignore high order $w$ bits
Implements Modular Arithmetic
- $\displaystyle UMult_{w}( u,v) =( u\cdot v)\bmod 2^{w}$

Signed Multiplication in C#

Standard Multiplication Function
- Ignore high order $w$ bits
- Some of which are different for signed vs. unsigned multiplication
- Lower bits are the same

Power-of-2 Multiply with Shift#

Operation
- u << k gives $u\cdot 2^{k}$ (basically increases each bits weight)
- Both signed and unsigned

Unsigned Power-of-2 Divide with Shift#

Quotient of Unsigned by Power of 2
- u >> k gives $\left\lfloor u/2^{k}\right\rfloor$
- Uses logical shift

Signed Power-of-2 Divide with Shift#

Quotient of Signed by Power of 2
- Uses arithmetic shift
- Want $\lceil x/2^{k} \rceil$ (Round toward 0)
- Compute as $\left\lfloor \left( x+2^{k} -1\right) /2^{k}\right\rfloor$ $⌊ (x + 2^{k} - 1) / 2^{k} ⌋$
  - In C: (x + (1 << k) - 1) >> k
  - Biases dividend toward 0

Case 1: No rounding#

Case 2: Rounding#

Handy tricks#

If you want convert a value $u$ to its negative form by hand or in mind, either unsigned or signed.

Just do: $\sim u+1$

When Should I Use Unsigned ?#

CAUTION
Don’t use without understanding implications!

Do use when performing modular arithmetic
- Multiplication arithmetic
Do use when using bits to represent sets
- Logical right shift, no sign extension

Representation in memory, pointers, strings#

Byte-Oriented Memory Organization#

Programs refer to data by address
- Conceptually, envision it as a very large array of bytes
  - In reality, it’s not, but can think of it that way
- An address is like an index into that array
  - and, a pointer variable stores an address
System provides private address spaces to each “process”
- So, a program can clobber its own data, but not that of others

Machine Words#

Any given computer has a “Word Size”
- Nominal size of integer-valued data
  - and of addresses
- Until recently, most machines used 32 bits (4 bytes) as word size
  - Limits addresses to 4 GB ( $2^{32}$ bytes)
- Increasingly, machines have 64‐bit word size
  - Potentially, could have 18 EB (exabytes) of addressable memory
  - That’s $18.4\times 10^{18}$ bytes
- Machines still support multiple data formats
  - Fractions or multiples of word size
  - Always integral number of bytes

Word‐Oriented Memory Organization#

Addresses Specify Byte Locations
- Address of first byte in word
- Addresses of successive words differ by 4 (32‐bit) or 8 (64‐bit)

Byte Ordering#

Big Endian: Sun, PPC Mac, Internet
- Least significant byte has highest address
Little Endian: x86, ARM processors running Android, iOS, and Windows
- Least significant byte has lowest address

Representation Strings#

In C, either little endian or big endian machine, strings in memory represented in the same way, because a string is essentially an array of characters ends with \x00, each character is one byte encoded in ASCII format and single byte (character) do not obey the byte ordering rule.

Floating Point#

Background: Fractional binary numbers#

Representing#

Bits to right of “binary point” represent fractional powers of 2
Represents rational number: $\displaystyle \sum _{k=-j}^{i} b_{k} \cdot 2^{k}$

Value	Representation
$\displaystyle5\frac{3}{4}$	$101.11_{2}$
$\displaystyle2\frac{7}{8}$	$10.111_{2}$
$\displaystyle1\frac{7}{16}$	$1.0111_{2}$

Observations#

Divide by 2 by shifting right (unsigned)
Multiply by 2 by shifting left
Numbers of form $0.111111\dots_{2}$ $0.111111 \dots_{2}$ are just below $1.0$ $1.0$
- $\displaystyle \frac{1}{2} +\frac{1}{4} +\frac{1}{8} +\dots+\frac{1}{2^{i}} +\dots\rightarrow 1.0$
- Use notation $1.0−\epsilon$ ( $\epsilon$ depends on how many bits you have to the right of the binary point. If it gets smaller the more, the more of those bits you have there, and it gets closer to $1$ )

Limitation 1#

Can only exactly represent numbers of the form $\displaystyle\frac{x}{2^{k}}$ $\frac{x}{2 ^{k}}$
- Other rational numbers have repeating bit representations, but cause computer system can only hold a finite number of bits, so $0.1+0.2\neq 0.3$

Value	Representation
$\displaystyle\frac{1}{3}$	$0.0101010101[ 01] \dots_{2}$
$\displaystyle\frac{1}{5}$	$0.001100110011[ 0011] \dots_{2}$
$\displaystyle\frac{1}{10}$	$0.0001100110011[ 0011] \dots_{2}$

Limitation 2#

Just one setting of binary point within the $w$ $w$ bits
- Limited range of numbers (very small values ? very large ? we have to move the binary point to represent sort of wide as wide a range as possible with as much precision given the number of bits)

Definition: IEEE Floating Point Standard#

IEEE Standard 754#

Established in 1985 as uniform standard for floating point arithmetic. Before that, many idiosyncratic formats.

Although it provided nice standards for rounding, overflow, underflow… It is hard to make fast in hardware (Numerical analysts predominated over hardware designers in defining standard)

Floating Point Representation#

Numerical form: $( -1)^{s} \cdot M\cdot 2^{E}$

Sign bit $s$ determines whether number is negative or positive
Significand $M$ (Mantissa) normally a fractional value in range $[ 1.0,2.0)$
Exponent $E$ weights value by power of two

Encoding#

MSB s is sign bit $s$
exp field encodes $E$ (but is not equal to $E$ )
frac field encodes $M$ (but is not equal to $M$ )

Normalized Values#

Condition: $exp\neq 000\dotsc 0$ and $exp\neq 111\dotsc 1$
Exponent coded as a biased value: $E=exp-bias$ $E = e x p - bia s$
- $exp$ is unsigned value of exp field
- $bias=2^{k-1} -1$ $bia s = 2^{k - 1} - 1$ , where $k$ $k$ is number of exponent bits
  - Single precision: $127$ ( $exp\in [ 1,254]$ , $E\in [ -126,127]$ )
  - Double precision: $1023$ ( $exp\in [ 1,2046]$ , $E\in [ -1022,1023]$ )
Significand coded with implied leading $1$ $1$ : $M=1.xxx\dotsc x_{2}$ $M = 1. xxx \dots x_{2}$
- $xxx\dotsc x$ is bits of frac field
- Minimum when $frac=000\dotsc 0\ ( M=1.0)$
- Maximum when $frac=111\dotsc 1\ ( M=2.0-\epsilon )$
- Get extra leading bit for “free”

Normalized Encoding Example#

Value: float F = 15213.0;
- $15213_{10} =11101101101101_{2} =1.1101101101101\times 2^{13}$
Significand
- $M=( 1.) 1101101101101_{2}$
- $frac=11011011011010000000000_{2}$
Exponent
- $E=13$
- $bias=127$
- $exp=140=10001100_{2}$

So the result would be:

1
# manually calculate
2
((-1)**0)*(1+1/2+1/4+1/16+1/32+1/128+1/256+1/1024+1/2048+1/8192)*(2**13) == 15213.0
3

4
# by struct package
5
import struct
6

7
bits = 0b01000110011011011011010000000000
8
f = struct.unpack("f", struct.pack("I", bits))[0]
9

10
print(f)

Denormalized Values#

Condition: $exp=000\dotsc 0$
Exponent value: $E=1-bias\ ( instead\ of\ E=0-bias)$
Significand coded with implied leading $0$ $0$ : $M=0.xxx\dotsc x_{2}$ $M = 0. xxx \dots x_{2}$
- $xxx\dotsc x$ is bits of frac field
Cases
- $exp=000\dotsc 0,frac=000\dotsc 0$ $exp=000…0,frac=000…0$
  - Represents zero value
  - Note distinct values: $+0$ and $-0$
- $exp=000\dotsc 0,farc\neq 000\dotsc 0$ $e x p = 000 \dots 0, f a rc \neq = 000 \dots 0$
  - Numbers closet to $0.0$
  - Equispaced

Special Values#

Condition: $exp=111\dotsc 1$
Case: $exp=111\dotsc 1,frac=000\dotsc 0$
- Represents value $\infty$
- Operation that overflows
- Both positive and negative
- E.g., $1.0/0.0=-1.0/-0.0=+\infty ,1.0/-0.0=-\infty$
Case: $exp=111\dotsc 1,frac\neq 000\dotsc 0$
- Not-a-Number (NaN)
- Represents case when no numeric value can be determined
- E.g., $\sqrt{-1} ,\infty -\infty ,\infty \cdot 0$

Visualization: Floating Point Encodings#

Example and properties#

Tiny Floating Point Example#

Think about this 8-bit Floating Point Representation below, it obeying the same general form as IEEE Format:

Dynamic Range (Positive Only)#

Distribution of Values#

Still think about our tiny example, notice how the distribution gets denser toward zero.

Here is a scaled close-up view:

Special Properties of the IEEE Encoding#

Floating Point Zero Same as Integer Zero
- All bits = 0
Can (Almost) Use Unsigned Integer Comparison
- Must first compare sign bits
- Must consider $-0=0$
- NaNs problematic
  - What should comparison yield ?
- Otherwise OK
  - Denormalized vs. Normalized
  - Normalized vs. Infinity

Rounding, addition, multiplication#

Floating Point Operations: Basic Idea#

$x+_{f} y=round( x+y)$

$x\times _{f} y=round( x\times y)$

Basic idea#

First compute exact result
Make it fit into desired precision
- Possibly overflow if exponent too large
- Possibly round to fit into frac

Rounding#

Rounding Modes (illustrate with $ rounding)

	$1.40	$1.60	$1.50	$2.50	-$1.50
Towards Zero	$1	$1	$1	$2	-$1
Round Down ( $-\infty$ )	$1	$1	$1	$2	-$2
Round Up ( $+\infty$ )	$2	$2	$2	$3	-$1
Nearest Even (default)	$1	$2	$2	$2	-$2

Round to Even
It means, if you have a value that’s less than half way then you round down, if more than half way, round up. When you have something that’s exactly half way, then what you do is round towards the nearest even number.

Closer Look at Round-To-Even#

Default Rounding Mode#

Hard to get any other kind without dropping into assembly
All others are statistically biased
- Sum of set of positive numbers will consistently be over or underestimated

Applying to Other Decimal Places / Bit Positions#

When exactly half way between two possible values
- Round so that least significant digit is even

E.g., round to nearest hundredth:

Value	Rounded
7.8949999	7.89	Less than half way
7.8950001	7.90	Greater than half way
7.8950000	7.90	Half way (round up)
7.8850000	7.88	Half way (rond down)

Rounding Binary Numbers#

Binary Fractional Numbers

“Even” when least significant bit is 0
“Half way” when bits to right of rounding position is $100\dotsc _{2}$

E.g., round to nearest $\displaystyle \frac{1}{4}$ (2 bits right of binary point)

Value	Binary	Rounded	Action	Rounded Value
$\displaystyle 2\frac{3}{32}$	$10.00011_{2}$	$10.00_{2}$	Less than half way (round down)	$2$
$\displaystyle 2\frac{3}{16}$	$10.00110_{2}$	$10.01_{2}$	Greater than half way (round up)	$\displaystyle 2\frac{1}{4}$
$\displaystyle 2\frac{7}{8}$	$10.11100_{2}$	$11.00_{2}$	Half way (round up)	$3$
$\displaystyle 2\frac{5}{8}$	$10.10100_{2}$	$10.10_{2}$	Half way (round down)	$\displaystyle 2\frac{1}{2}$

Floating Point Multiplication#

$( -1)^{s_{1}} \cdot M_{1} \cdot 2^{E_{1}} \cdot ( -1)^{s_{2}} \cdot M_{2} \cdot 2^{E_{2}} =( -1)^{s} \cdot M \cdot 2^{E}$

Sign $s$ is $s_{1}$ ^ $s_{2}$
Significand $M$ is $M_{1} \cdot M_{2}$
Exponent $E$ is $E_{1}+E_{2}$

Fixing#

If $M\geqslant 2$ , shift $M$ right, increment $E$
If $E$ out of range, overflow
Round $M$ to fit frac precision

Biggest chore in implementation is multiplying significands

Floating Point Addition#

$( -1)^{s_{1}} \cdot M_{1} \cdot 2^{E_{1}} +( -1)^{s_{2}} \cdot M_{2} \cdot 2^{E_{2}} =( -1)^{s} \cdot M \cdot 2^{E}$ (Assume $E_{1}>E_{2}$ )

Sign $s$ $s$ , significand $M$ $M$
- Result of signed align & add
Exponent $E$ $E$
- $E_{1}$

Fixing#

If $M\geqslant 2$ , shift $M$ right, increment $E$
If $M<1$ , shift $M$ left $k$ positions, decrement $E$ by $k$
Overflow if $E$ out of range
Round $M$ to fit frac precision

Mathematical Properties of Floating Point Add#

Compare to those of Abelian Group
- Closed under addition ? (Yes)
  - But may generate infinity or NaN
- Commutative ? (Yes)
- Associative ? (No)
  - Overflow and inexactness of rounding
  - (3.14+1e10)-1e10 = 0, 3.14+(1e10-1e10) = 3.14
- $0$ is additive identity ? (Yes)
- Every element has additive inverse ? (Almost)
  - Except for infinities & NaNs
Monotonicity
- $a\geqslant b\Rightarrow a+c\geqslant b+c$ $a ⩾ b \Rightarrow a + c ⩾ b + c$ ? (Almost)
  - Except for infinities & NaNs

Mathematical Properties of Floating Point Mult#

Compare to Commutative Ring
- Closed under multiplication ? (Yes)
  - But may generate infinity or NaN
- Multiplication Commutative ? (Yes)
- Multiplication is Associative ? (No)
  - Possibility of overflow, inexactness of rounding
  - (1e20*1e20)*1e-20 = inf, 1e20*(1e20*1e-20) = 1e20
- $1$ is multiplicative identity ? (Yes)
- Multiplication distributes over addition ? (No)
  - Possibility of overflow, inexactness of rounding
  - 1e20*(1e20-1e20) = 0.0, 1e20*1e20-1e20*1e20 = NaN
Monotonicity
- $a\geqslant b\ \&\ c\geqslant 0\Rightarrow a\cdot c\geqslant b\cdot c$ $a ⩾ b & c ⩾ 0 \Rightarrow a \cdot c ⩾ b \cdot c$ ? (Almost)
  - Except for infinities & NaNs

Floating Point in C#

Conversions / Casting#

CAUTION
Casting between int, float, and double changes bit representation!

double/float to int
- Truncates fractional part
- Like rounding toward zero
- Not defined when out of range or $NaN$ : Generally sets to $TMin$
int to double
- Exact conversion, as long as int has $\leqslant 53$ bit word size
int to float
- Will round according to rounding mode

Machine-Level Programming#

History of Intel processors and architectures#

Nobody (at least me), can keep these long history in mind! So I’ll just skipping this part to save life~

C, Assembly, Machine Code#

Definitions#

Architecture: (also ISA: instruction set architecture) The parts of a processor design that one needs to understand or write assembly/machine code
- Examples: instruction set specification, registers
Microarchitecture: Implementation of the architecture
- Examples: cache sizes and core frequency

Assembly / Machine Code View#

Turning C into Object Code#

Code in files p1.c p2.c
Compile with command: gcc -Og p1.c p2.c -o p
- Use basic optimizations (-Og) [New to recent versions of GCC]
- Put resulting binary in file p

Assembly Characteristics: Data Types#

“Integer” data of 1, 2, 4 or 8 bytes
- Data values
- Address (untyped pointers)
Floating Point data of 4, 8 or 10 bytes
Code: Byte sequences encoding series of instructions
No aggregate types such as arrays or structures
- Just contiguously allocated bytes in memory

Assembly Characteristics: Operations#

Perform arithmetic function on register or memory data
Transfer data between memory and register
- Load data from memory into register
- Store register data into memory
Transfer control
- Conditional branches
- Unconditional jumps to/from procedures

Object Code#

Assembler
- Translates .s into .o
- Binary encoding of each instruction
- Nearly-complete image of executable code
- Missing linkages between code in different files
Linker
- Resolves references between files
- Combines with static run-time libraries
  - E.g., code for malloc, printf
- Some libraries are dynamically linked
  - Linking occurs when program begins execution

Assembly Basics: Registers, Operands, Move#

x86‐64 Integer Registers#

Moving Data#

movq src, dst
Operand Types
- Immediate: Constant integer data
  - E.g., $0x400, $-533
  - Like C constant, but prefixed with $
  - Encoded with 1, 2, or 4 bytes
- Register: One of 16 integer registers
  - E.g., %rax, %r13
  - But %rsp reserved for special use
  - Others have special uses for particular instructions
- Memory: 8 consecutive bytes of memory at address given by register
  - Simplest example: (%rax)
  - Various other “address modes”

Complete Memory Addressing Modes#

D(Rb, Ri, S), Mem[Reg[Rb] + S * Reg[Ri] + D]
- D: Constant “displacement” 1, 2, or 4 bytes
- Rb: Base register: Any of 16 integer registers
- Ri: Index register: Any, except for %rsp
- S: Scale (1, 2, 4, or 8)
Special Cases
- (Rb, Ri): Mem[Reg[Rb] + Reg[Ri]]
- D(Rb, Ri): Mem[Reg[Rb] + Reg[Ri] + D]
- (Rb, Ri, S): Mem[Reg[Rb] + S * Reg[Ri]]

Arithmetic & Logical Operations#

Address Computation Instruction#

leaq src, dst
- src is address mode expression
- Set dst to address denoted by expression
Uses
- Computing addresses without a memory reference
  - E.g., translation of p = &x[i];
- Computing arithmetic expressions of the form x + k * y
  - k equals to 1, 2, 4, or 8

Here is an example of computing arithmetic expression with leaq:

1
long m12(long x) {
2
  return x * 12;
3
}

Converted to ASM by compiler:

1
leaq (%rdi, %rdi, 2), %rax # t <- x + x * 2
2
salq $2, %rax              # return t << 2

A bit more complex one:

1
long arith(long x, long y, long z) {
2
  long t1 = x + y;
3
  long t2 = z + t1;
4
  long t3 = x + 4;
5
  long t4 = y * 48;
6
  long t5 = t3 + t4;
7
  long rval = t2 * t5;
8

9
  return rval;
10
}

1
arith:
2
  leaq (%rdi, %rsi), %rax     # t1
3
  addq %rdx, %rax             # t2
4
  leaq (%rsi, %rsi, 2), %rdx
5
  salq $4, %rdx               # t4
6
  leaq 4(%rdi, %rdx), %rcx    # t5
7
  imulq $rcx, %rax            # rval
8
  ret

Control: Condition Codes#

Condition Codes (Implicit Setting)#

Think of it as a side effect by arithmetic operations.

leaq won’t effect flags.

Condition Codes (Explicit Setting)#

Compare Instrucion
- cmpq src2, src1: computing src1 - src2 without setting destination
Test Instruction
- testq src2, src1: computing src1 & src2 without setting destination
- Useful to have one of the operands be a mask

Reading Condition Codes#

setX Instructions
- Set low‐order byte of destination (must one of addressable byte registers) to 0 or 1 based on combinations of condition codes
- Does not alter remaining 7 bytes
  - Typically use movzbl (Move with Zero-Extend from Byte to Long) to finish job
    - 32‐bit instructions also set upper 32 bits to 0

NOTE
Any computation where the result is a 32-bit result, it will zero out the higher 32-bits of the register. And its different for example the byte level operations only affect the bytes, the word operations only affect two bytes.

setX	Condition	Description
`sete`	`ZF`	Equal / Zero
`setne`	`~ZF`	Not Equal / Not Zero
`sets`	`SF`	Negative
`setns`	`~SF`	Nonnegative
`setg`	`~(SF ^ OF) & ~ZF`	Greater (Signed)
`setge`	`~(SF ^ OF)`	Greater or Equal (Signed)
`setl`	`(SF ^ OF)`	Less (Signed)
`setle`	`(SF ^ OF) \| ZF`	Less or Equal (Signed)
`seta`	`~CF & ~ZF`	Above (Unsigned)
`setb`	`CF`	Below (Unsigned)

NOTE
sil, dil, spl, bpl are all 1 byte registers.

1
int gt(long x, long y) {
2
  return x > y;
3
}

1
cmpq   %rsi, %rdi # Compare x:y
2
setg   %al        # Set %al to 1 when >
3
movzbl %al, %eax  # Zero rest of %rax
4
ret

Conditional Branches#

Jumping#

jX Instructions
- Jump to different part of code depending on condition codes

jX	Condition	Description
`jmp`	`1`	Unconditional
`je`	`ZF`	Equal / Zero
`jne`	`~ZF`	Not Equal / Not Zero
`js`	`SF`	Negative
`jns`	`~SF`	Nonnegative
`jg`	`~(SF ^ OF) & ~ZF`	Greater (Signed)
`jge`	`~(SF ^ OF)`	Greater or Equal (Signed)
`jl`	`SF ^ OF`	Less (Signed)
`jle`	`(SF ^ OF) \| ZF`	Less or Equal (Signed)
`ja`	`~CF & ~ZF`	Above (Unsigned)
`jb`	`CF`	Below (Unsigned)

Using Conditional Moves#

Conditional Move Instructions
- Instruction supports: if (Test) Dest <- Src
- Supported in post-1995 x86 processors
- GCC tries to use them (but, only when known to be safe)
  - Branches are very disruptive to instruction flow through pipelines
  - Conditional moves do not require control transfer

Here is a simple example of conditional move:

1
long absdiff(long x, long y) {
2
  long result;
3
  if (x > y)
4
    result = x - y;
5
  else
6
    result = y - x;
7
  return result;
8
}

1
absdiff:
2
  movq   %rdi, %rax # x
3
  subq   %rsi, %rax # result = x - y
4
  movq   %rsi, %rdx
5
  subq   %rdi, %rdx # eval = y - x
6
  cmpq   %rsi, %rdi # x:y
7
  cmovle %rdx, %rax # if <=, result = eval
8
  ret

Bad Cases for Conditional Move#

Expensive Computations: val = Test(x) ? Hard1(x) : Hard2(x);
- Both values get computed
- Only makes sense when computations are very simple
Risky Computations: val = p ? *p : 0;
- Both values get computed
- May have undesirable effects
Computations with side effects: val = x > 0 ? x *= 7 : x += 3;
- Both values get computed
- Must be side-effect free

Loops#

IMPORTANT
Each kind of loop will convert to their goto version and then assembly code.

”Do-While” Loop#

C Code:

1
long pcount_do(unsigned long x) {
2
  long result = 0;
3
  do {
4
    result += x & 0x1;
5
    x >>= 1;
6
  } while (x);
7
  return result;
8
}

Goto Version:

1
long pcount_goto(unsigned long x) {
2
  long result = 0;
3
loop:
4
  result += x & 0x1;
5
  x >>= 1;
6
  if (x) goto loop;
7
  return result;
8
}

Assembly Code:

1
  movl $0, %eax   # result = 0
2
.L2:              # loop:
3
  movq %rdi, %rdx
4
  andl $1, %edx   # t = x & 0x1
5
  addq %rdx, %rax # result += t
6
  shrq %rdi       # x >>= 1
7
  jne  .L2        # if (x) goto loop
8
  rep; ret

General “Do-While” Translation#

1
do {
2
  Body
3
} while (Test);

1
loop:
2
  Body
3
  if (Test)
4
    goto loop

”While” Loop#

General “While” Translation 1 (“Jump‐to-middle” translation, compiled with `-Og`)#

1
while (Test) {
2
  Body
3
}

1
  goto test
2
loop:
3
  Body
4
test:
5
  if (Test)
6
    goto loop
7
done:

General “While” Translation 2 (Compiled with `-O1`)#

1
while (Test) {
2
  Body
3
}

To Do-While Version:

1
  if (!Test)
2
    goto done;
3
  do {
4
    Body
5
  } while (Test);
6
done:

Goto Version:

1
  if (!Test)
2
    goto done;
3
loop:
4
  Body
5
  if (Test)
6
    goto loop;
7
done:

The initial conditional guards entrance to loop.

”For” Loop#

”For” Loop “While” Conversion#

For Version:

1
for (Init; Test; Update)
2
  Body

While Version:

1
Init;
2
while (Test) {
3
  Body
4
  Update;
5
}

“For” Loop “Do-While” Conversion#

Initial test can be optimized away.

Switch Statements#

Switch Statement Example#

1
long switch_eg(long x, long y, long z) {
2
  long w = 1;
3
  switch (x) {
4
    case 1:
5
      w = y * z;
6
      break;
7
    case 2:
8
      w = y / z;
9
    /* Fall Through */
10
    case 3:
11
      w += z;
12
      break;
13
    case 5:
14
    case 6:
15
      w -= z;
16
      break;
17
    default:
18
      w = 2;
19
    }
20
  return w;
21
}

Multiple case labels: 5 & 6
Fall through cases: 2
Missing cases: 4

Switch Statement Example in Assembly#

1
switch_eg:
2
  movq %rdx, %rcx
3
  cmpq $6, %rdi        # x:6
4
  ja   .L8             # use default
5
  jmp  *.L4(, %rdi, 8) # goto *JTab[x]

Note that w not initialized here.

TIP
ja .L8 considering the result is unsigned, so its a smart way to tackle negative number.

Jump Table Structure#

Jump Table#

1
.section .rodata
2
  .align 8
3
.L4:
4
  .quad .L8 # x = 0
5
  .quad .L3 # x = 1
6
  .quad .L5 # x = 2
7
  .quad .L9 # x = 3
8
  .quad .L8 # x = 4
9
  .quad .L7 # x = 5
10
  .quad .L7 # x = 6

Table Structure
- Each target requires 8 bytes
- Base address at .L4
Jumping
- Direct: jmp .L8
  - Jump target is denoted by label .L8
- Indirect: jmp *.L4(, %rdi, 8)
  - Start of jump table: .L4
  - Must scale by factor of 8 (addresses are 8 bytes)
  - Fetch target from effective address .L4 + %rdi * 8
    - Only for 0 <= x <= 6

Procedures#

Stack Structure#

x86-64 Stack#

Region of memory managed with stack discipline
Grows toward lower addresses
Register %rsp contains lowest stack address
- address of “top” element

Push#

pushq src
- Fetch operand at src
- Decrement %rsp by 8
- Write operand at address given by %rsp

Pop#

popq dest
- Read value at address given by %rsp
- Increment %rsp by 8
- Store value at dest (must be register)

Calling Conventions#

Procedure Control Flow#

Use stack to support procedure call and return
Procedure call: call label
- Push return address (address of the next instruction right after call) on stack
- Jump to label
Procedure return: ret
- Pop address from stack
- Jump to address

Managing local data#

First 6 arguments are passing by registers: %rdi, %rsi, %rdx, %rcx, %r8, %r9, other more arguments will store in stack
Only allocate stack space when needed
Return value store in %rax

Stack Frames#

Contents
- Return information
- Local storage (if needed)
- Temporary space (if needed)
Management
- Space allocated when enter procedure
  - “Set-up” code
  - Includes push by call instruction
- Deallocated when return
  - “Finish” code
  - Includes pop by ret instruction

x86-64 / Linux Stack Frame#

Current Stack Frame (“Top” to Bottom)
- Argument build: Parameters for function about to call
Local variables
- If can’t keep in registers
Saved register context
Old frame pointer (optional)
Caller Stack Frame
- Return address
- Arguments for this call

NOTE
As we often see the program often allocates more space on the stack than it really needs to, its because some conventions about trying to keep addresses on aligned.

Register Saving Conventions#

Caller Saved
- Caller saves temporary values in its frame before the call
Callee Saved
- Callee saves temporary values in its frame before using
- Callee restores them before returning to caller

x86-64 Linux Register Usage 1#

%rax
- Return value
- Also caller-saved
- Can be modified by procedure
%rdi, %rsi, %rdx, %rcx, %r8, %r9
- Arguments
- Also caller-saved
- Can be modified by procedure
%r10, %r11
- Caller-saved
- Can be modified by procedure

x86-64 Linux Register Usage 2#

%rbx, %r12, %r13, %r14
- Callee-saved
- Callee must save & restore
%rbp
- Callee-saved
- Callee must save & restore
- May be used as frame pointer
- Can mix & match
%rsp
- Special form of callee save
- Restored to original value upon exit from procedure

Recursion Example#

1
long pcount_r(unsigned long x) {
2
  if (x == 0)
3
    return 0;
4
  else:
5
    return (x & 1) + pcount_r(x >> 1);
6
}

1
pcount_r:
2
  movl  $0, %eax
3
  testq %rdi, %rdi
4
  je    .L6
5
  pushq %rbx
6
  movq  %rdi, %rbx
7
  andl  $1, %ebx
8
  shrq  %rdi
9
  call  pcount_r
10
  addq  %rbx, %rax
11
  popq  %rbx
12
.L6:
13
  rep; ret

Handled Without Special Consideration (just using normal calling conventions)
- Stack frames mean that each function call has private storage
  - Saved registers & local variables
  - Saved return pointer
- Register saving conventions prevent one function call from corrupting another’s data
  - Unless the C code explicitly does so
- Stack discipline follows call / return pattern
  - If P calls Q, then Q returns before P
  - Last-In, First-Out
Also works for mutual recursion
- P calls Q; Q calls P

Arrays#

One-dimensional Array#

Array Allocation#

T A[L];
- Array of data type T and length L
- Contiguously allocated region of L * sizeof(T) bytes in memory

Array Access#

T A[L];
- Array of data type T and length L
- Identifier A can be used as a pointer to array element 0: Type T *

Array Accessing Example#

1
#define ZLEN 5
2

3
typedef int zip_dig[ZLEN];
4

5
int get_digit(zip_dig z, int digit) {
6
  return z[digit];
7
}

1
movl (%rdi, %rsi, 4), %eax # z[digit]

Register %rdi contains starting address of array
Register %rsi contains array index
Desired digit at %rdi + 4 * %rsi

Array Loop Example#

1
void zincr(zip_dig z) {
2
  size_t i;
3
  for (i = 0; i < ZLEN; i++) {
4
    z[i]++;
5
  }
6
}

1
  movl $0, %eax            # i = 0
2
  jmp .L3                  # goto middle
3
.L4:                       # loop:
4
  addl $1, (%rdi, %rax, 4) # z[i]++
5
  addq $1, %rax            # i++
6
.L3:                       # middle
7
  cmpq $4, %rax            # i:4
8
  jbe .L4                  # if <=, goto loop
9
  rep; ret

Multidimensional (Nested) Arrays#

T A[R][C]
- 2D array of data type T
- R rows, C columns
- Type T element requires K bytes
Array Size
- R * C * K bytes
Arrangement
- Row-Major Ordering

Nested Array Example#

1
#define ZLEN 5
2
#define PCOUNT 4
3

4
typedef int zip_dig[ZLEN];
5

6
zip_dig pgh[PCOUNT] = {
7
  {1, 5, 2, 0, 6},
8
  {1, 5, 2, 1, 3},
9
  {1, 5, 2, 1, 7},
10
  {1, 5, 2, 2, 1}
11
}

zip_dig pgh[4] is equivalent to int pgh[4][5].

Nested Array Row Access#

Row Vectors
- A[i] is array of C elements
- Each element of type T requires K bytes
- Starting address A + i*(C*K)

Nested Array Row Access Example#

1
int *get_pgh_zip(int index) {
2
  return pgh[index];
3
}

1
leaq (%rdi, %rdi, 4), %rax # 5*index
2
leaq pgh(, %rax, 4), %rax  # pgh + (20*index)

Row Vector
- pgh[index] is array of 5 int’s
- Starting address pgh + 20*index
Machine Code
- Computes and returns address
- Compute as pgh + 4*(index + 4*index)

Nested Array Element Access Example#

Array Elements
- A[i][j] is element of type T, which requires K bytes
- Address A + i*(C*K) + j*K = A + (i*C + j)*K

Nested Array Element Access Example#

1
int get_pgh_digit(int index, int dig) {
2
  return pgh[index][dig];
3
}

1
leaq (%rdi, %rdi, 4), %rax # 5*index
2
addl %rax, %rsi            # 5*index + dig
3
movl pgh(, %rsi, 4), %eax  # M[pgh + 4*(5*index + dig)]

Array Elements
- pgh[index][dig] is int
- Address: Mem[pgh + 20*index + 4*dig] = Mem[pgh + 4*(5*index + dig)]

Multi-Level Array#

1
zip_dig cmu = {1, 5, 2, 1, 3};
2
zip_dig mit = {0, 2, 1, 3, 9};
3
zip_dig ucb = {9, 4, 7, 2, 0};
4

5
#define UCOUNT 3
6

7
int *univ[UCOUNT] = {mit, cmu, ucb};

Variable univ denotes array of 3 elements
Each element is a pointer (8 bytes)
Each pointer points to array of int’s

Element Access in Multi-Level Array#

1
int get_univ_digit(size_t index, size_t digit) {
2
  return univ[index][digit];
3
}

1
salq $2, %rsi              # 4*digit
2
addq univ(, %rdi, 8), %rsi # p = univ[index] + 4*digit
3
movl (%rsi), %eax          # return *p
4
ret

Element access Mem[Mem[univ + 8*index] + 4*digit]
Must do two memory reads
- First get pointer to row array
- Then access element within array

N x N Matrix#

Fixed dimensions
- Know value of N at compile time

1
#define N 16
2

3
typedef int fix_matrix[N][N];
4

5
/* Get element A[i][j] */
6
int fix_ele(fix_matrix A, size_t i, size_t j) {
7
  return A[i][j];
8
}

Variable dimensions, explicit indexing
- Traditional way to implement dynamic arrays

1
#define IDX(n, i, j) ((i)*(n)+(j))
2

3
/* Get element A[i][j] */
4
int vec_ele(size_t n, int *A, size_t i, size_t j) {
5
  return A[IDX(n,i,j)];
6
}

Variable dimensions, explicit indexing
- Now supported by gcc

1
/* Get element A[i][j] */
2
int var_ele(size_t n, int A[n][n], size_t i, size_t j) {
3
  return A[i][j];
4
}

N x N Matrix Access#

Array Elements
- size_t n;
- int A[n][n];
- Address A + i*(C*K) + j*K
- C = n, K = 4
- Must perform integer multiplication

1
/* Get element A[i][j] */
2
int var_ele(size_t n, int A[n][n], size_t i, size_t j) {
3
  return A[i][j];
4
}

1
imulq %rdx, %rdi           # n*i
2
leaq (%rsi, %rdi, 4), %rax # A + 4*n*i
3
movl (%rax, %rcx, 4), %eax # A + 4*n*i + 4*j
4
ret

Understanding Pointers & Array#

Cmp: Compiles (Y/N)
Bad: Possible bad pointer reference (Y/N)
Size: Value returned by sizeof

Structures#

Allocation#

Structure represented as block of memory
- Big enough to hold all of the fields
Fields ordered according to declaration
- Even if another ordering could yield a more compact representation
Compiler determines overall size + positions of fields
- Machine-level program has no understanding of the structures in the source code

Access#

Generating Pointer to Structure Member#

1
struct rec {
2
  int a[4];
3
  size_t i;
4
  struct rec *next;
5
};
6

7
int *get_ap(struct rec *r, size_t idx) {
8
  return &r->a[idx];
9
}

1
leaq (%rdi, %rsi, 4), %rax
2
ret

Offset of each structure member determined at compile time
Compute as r + 4*idx

Following Linked List#

1
struct rec {
2
  int a[4];
3
  int i;
4
  struct rec *next;
5
};
6

7
void set_val(struct rec *r, int val) {
8
  while (r) {
9
    int i = r->i;
10
    r->a[i] = val;
11
    r = r->next;
12
  }
13
}

1
.L11:                        # loop:
2
  movslq 16(%rdi), %rax      # i = M[r + 16]
3
  movl %esi, (%rdi, %rax, 4) # M[r + 4*i] = val
4
  movq 24(%rdi), %rdi        # r = M[r + 24]
5
  testq %rdi, %rdi           # Test r
6
  jne .L11                   # if != 0 goto loop

Alignment#

1
struct S1 {
2
  char c;
3
  int i[2];
4
  double v;
5
} *p;

Unaligned Data

Aligned Data
- Primitive data type requires K bytes
- Address must be multiple of K
- Required on some machines; advised on x86-64

Motivation for Aligning Data
- Memory accessed by (aligned) chunks of 4 or 8 bytes (system dependent)
  - Inefficient to load or store datum that spans quad word boundaries
  - Virtual memory trickier when datum spans 2 pages
Compiler
- Inserts gaps in structure to ensure correct alignment of fields

Specific Cases of Alignment (x86-64)#

1 byte: char, …
- no restrictions on address
2 bytes: short, …
- lowest 1 bit of address must be $0_{2}$
4 bytes: int, float, …
- lowest 2 bits of address must be $00_{2}$
8 bytes: double, long, char *, …
- lowest 3 bits of address must be $000_{2}$

IMPORTANT
If an address requires alignment to $2^{n}$ bytes, then its lowest $n$ bits must be $0$ .

Satisfying Alignment with Structures#

Within structure
- Must satisfy each element’s alignment requirement
Overall structure placement
- Each structure has alignment requirement K
  - K is largest alignment of any element
- Initial address & structure length must be multiples of K
Example

1
struct S2 {
2
  double v;
3
  int i[2];
4
  char c;
5
} *p;

Arrays of Structures#

Overall structure length multiple of K
Satisfy alignment requirement for every element

1
struct S2 {
2
  double v;
3
  int i[2];
4
  char c;
5
} a[10];

Accessing Array Elements#

Compute array offset 12*idx
- sizeof(S3), including alignment spacers
Element j is at offset 8 within structure
Assembler gives offset a+8
- Resolved during linking

1
struct S3 {
2
  short i;
3
  float v;
4
  short j;
5
} a[10];

1
short get_j(int idx) {
2
  return a[idx].j;
3
}

1
leaq   (%rdi, %rdi, 2), %rax # 3*idx
2
movzwl a+8(, %rax, 4), %eax

Saving Space#

Put large data types first

1
struct S4 {
2
  char c;
3
  int i;
4
  char d;
5
} *p;

1
struct S5 {
2
  int i;
3
  char c;
4
  char d;
5
} *p;

Floating Point#

x87 FP#

Legacy, very ugly

SSE3 FP#

XMM Registers#

Scalar & SIMD Operations#

SIMD (Single instruction, multiple data), there are a ton of usage combinations, e.g.:

addss (ADD Scalar Single-Precision Floating-Point Values)
addps (ADD Packed Single-Precision Floating-Point Values)

Basics#

Arguments passed in %xmm0, %xmm1, …
Result returned in %xmm0
All XMM registers caller-saved

1
float fadd(float x, float y) {
2
  return x + y;
3
}

1
addss %xmm1, %xmm0
2
ret

1
double dadd(double x, double y) {
2
  return x + y;
3
}

1
addsd %xmm1, %xmm0
2
ret

Memory Referencing#

Integer (and pointer) arguments passed in regular registers
Floating Point values passed in XMM registers
Different mov instructions to move between XMM registers, and between memory and XMM registers

1
double dincr(double *p, double v) {
2
  double x = *p;
3
  *p = x + v;
4
  return x;
5
}

1
movapd %xmm0, %xmm1 # Copy v
2
movsd (%rdi), %xmm0 # x = *p
3
addsd %xmm0, %xmm1  # t = x + v
4
movsd %xmm1, (%rdi) # *p = t
5
ret

movapd (Move Aligned Packed Double-Precision Floating-Point Values)

Other Aspects of Floating Point Instructions#

Lots of instructions
- Different operations, different formats, …
Floating-point comparisons
- Instructions ucomiss (Unordered Compare Scalar Single-Precision) and ucomisd (Unordered Compare Scalar Double-Precision)
- Set condition codes CF, ZF, and PF
Using constant values
- Set XMM0 register to 0 with instruction xorpd %xmm0, %xmm0
- Others loaded from memory

AVX FP#

Newest version
Similar to SSE

Floating Point assembly instruction sets are very nasty, though it’s principle thought is simple. So I am just skipping this chapter as TODO. Bro really don’t want learn this chapter…

Memory Layout#

x86-64 Linux Memory Layout#

Stack
- Runtime stack (8MB limit, check by limit command)
- E.g., local variables
Heap
- Dynamically allocated as needed
- When call malloc(), calloc(), new()
Data
- Statically allocated data
- E.g., global vars, static vars, string constants
Text / Shared Libraries
- Executable machine instructions
- Read-only

TIP
In common, the canonical address range in x86-64 is 47 bits address. That is why the maximum address show as 0x00007FFFFFFFFFFF.

Unions#

Allocation#

Allocate according to largest element
Can only use one field at a time

1
union U1 {
2
  char c;
3
  int i[2];
4
  double v;
5
} *up;

1
struct S1 {
2
  char c;
3
  int i[2];
4
  double v;
5
} *sp;

Program Optimization#

There’s more to performance than asymptotic complexity
Must optimize at multiple levels: algorithm, data representations, procedures, and loops

Optimizing Compilers#

Provide efficient mapping of program to machine
- register allocation
- code selection and ordering (scheduling)
- dead code elimination
- eliminating minor inefficiencies
Don’t (usually) improve asymptotic efficiency
- up to programmer to select best overall algorithm
- Big-O savings are (often) more important than constant factors
  - but constant factors also matter
Have difficulty overcoming “optimization blockers”
- potential memory aliasing
- potential procedure side-effects

Limitations of Optimizing Compilers#

Operate under fundamental constraint
- Must not cause any change in program behavior
  - Except, possibly when program making use of nonstandard language features
- Often prevents it from making optimizations that would only affect behavior under pathological conditions.
Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles
- E.g., Data ranges may be more limited than variable types suggest
Most analysis is performed only within procedures
- Whole-program analysis is too expensive in most cases
- Newer versions of GCC do interprocedural analysis within individual files
  - But, not between code in different files
Most analysis is based only on static information
- Compiler has difficulty anticipating run-time inputs
When in doubt, the compiler must be conservative

Generally Useful Optimizations#

Optimizations that you or the compiler should do regardless of processor / compiler

Code Motion
- Reduce frequency with which computation performed
  - If it will always produce same result
  - Especially moving code out of loop

1
void set_row(double *a, double *b, long i, long n) {
2
  long j;
3
  for (j = 0; j < n; j++)
4
    a[n*i+j] = b[j];
5
}

1
void set_row(double *a, double *b, long i, long n) {
2
  long j;
3
  int ni = n*i;
4
  for (j = 0; j < n; j++)
5
    a[ni+j] = b[j];
6
}

Compiler-Generated Code Motion (-O1)#

1
void set_row(double *a, double *b, long i, long n) {
2
  long j;
3
  for (j = 0; j < n; j++)
4
    a[n*i+j] = b[j];
5
}

1
set_row:
2
  testq %rcx, %rcx             # Test n
3
  jle .L1                      # If 0, goto done
4
  imulq %rcx, %rdx             # ni = n*i
5
  leaq (%rdi , %rdx, 8), %rdx  # rowp = A + ni*8
6
  movl $0, %eax                # j = 0
7
.L3:                           # loop:
8
  movsd (%rsi, %rax, 8), %xmm0 # t = b[j]
9
  movsd %xmm0, (%rdx, %rax, 8) # M[A + ni*8 + j*8] = t
10
  addq $1, %rax                # j++
11
  cmpq %rcx, %rax              # j:n
12
  jne .L3                      # if !=, goto loop
13
.L1:                           # done:
14
  rep ; ret

To the C code:

1
void set_row(double *a, double *b, long i, long n) {
2
  long j;
3
  long ni = n*i;
4
  double *rowp = a+ni;
5
  for (j = 0; j < n; j++)
6
    *rowp++ = b[j];
7
}

Reduction in Strength#

Replace costly operation with simpler one
Shift, add instead of multiply or divide
- Utility machine dependent
- Depends on cost of multiply or divide instruction
- Recognize sequence of products

1
for (i = 0; i < n; i++) {
2
  int ni = n*i;
3
  for (j = 0; j < n; j++)
4
    a[ni + j] = b[j];
5
}

1
int ni = 0;
2
for (i = 0; i < n; i++) {
3
  for (j = 0; j < n; j++)
4
    a[ni + j] = b[j];
5
  ni += n;
6
}

Reuse portions of expressions
GCC will do this with –O1

1
up = val[(i-1)*n + j ];
2
down = val[(i+1)*n + j ];
3
left = val[i*n + j-1];
4
right = val[i*n + j+1];
5
sum = up + down + left + right;

1
leaq 1(%rsi), %rax # i+1
2
leaq -1(%rsi), %r8 # i-1
3
imulq %rcx, %rsi   # i*n
4
imulq %rcx, %rax   # (i+1)*n
5
imulq %rcx, %r8    # (i-1)*n
6
addq %rdx, %rsi    # i*n+j
7
addq %rdx, %rax    # (i+1)*n+j
8
addq %rdx, %r8     # (i-1)*n+j

3 multiplications: i*n, (i–1)*n, (i+1)*n

1
long inj = i*n + j;
2
up = val[inj - n];
3
down = val[inj + n];
4
left = val[inj - 1];
5
right = val[inj + 1];
6
sum = up + down + left + right;

1
imulq %rcx, %rsi        # i*n
2
addq %rdx, %rsi         # i*n+j
3
movq %rsi, %rax         # i*n+j
4
subq %rcx, %rax         # i*n+j-n
5
leaq (%rsi, %rcx), %rcx # i*n+j+n

1 multiplication: i*n

Optimization Blockers#

NOTE
I’m just skipping this chapter, so the notes are seems disordered. Will retake this chapter later.

Optimization Blocker: Procedure Calls#

1
void lower(char *s) {
2
  size_t i;
3
  for (i = 0; i < strlen(s); i++)
4
    if (s[i] >= 'A' && s[i] <= 'Z')
5
      s[i] -= ('A' - 'a');
6
}

strlen executed every iteration
Time quadruples when double string length
Quadratic performance

Calling strlen#

1
/* My version of strlen */
2
size_t strlen(const char *s) {
3
  size_t length = 0;
4
  while (*s != '\0') {
5
    s++;
6
    length++;
7
  }
8
  return length;
9
}

strlen performance
- Only way to determine length of string is to scan its entire length, looking for null character
Overall performance, string of length N
- N calls to strlen
- Require times N, N-1, N‐2, …, 1
- Overall $O(N^{2})$ performance

Improving Performance#

1
void lower(char *s) {
2
  size_t i;
3
  size_t len = strlen(s);
4
  for (i = 0; i < len; i++)
5
    if (s[i] >= 'A' && s[i] <= 'Z')
6
      s[i] -= ('A' - 'a');
7
}

Move call to strlen outside of loop
Since result does not change from one iteration to another
Form of code motion

Improved Lower Case Conversion Performance#

Why couldn’t compiler move strlen out of inner loop ?#

Procedure may have side effects
- Alters global state each time called
Function may not return same value for given arguments
- Depends on other parts of global state
- Procedure lower could interact with strlen
Warning
- Compiler treats procedure call as a black box
- Weak optimizations near them
Remedies
- Use of inline functions
  - GCC does this with –O1 within single file
- Do your own code motion

Memory Matters#

1
/* Sum rows is of n X n matrix a
2
   and store in vector b */
3
void sum_rows1(double *a, double *b, long n) {
4
  long i, j;
5
  for (i = 0; i < n; i++) {
6
    b[i] = 0;
7
    for (j = 0; j < n; j++)
8
      b[i] += a[i*n + j];
9
  }
10
}

1
# sum_rows1 inner loop
2
.L4:
3
  movsd (%rsi, %rax, 8), %xmm0 # FP load
4
  addsd (%rdi), %xmm0          # FP add
5
  movsd %xmm0, (%rsi, %rax, 8) # FP store
6
  addq $8, %rdi
7
  cmpq %rcx, %rdi
8
  jne .L4

Memory Aliasing#

Code updates b[i] on every iteration
Must consider possibility that these updates will affect program behavior

Removing Aliasing#

1
/* Sum rows is of n X n matrix a
2
   and store in vector b */
3
void sum_rows2(double *a, double *b, long n) {
4
  long i, j;
5
  for (i = 0; i < n; i++) {
6
    double val = 0;
7
    for (j = 0; j < n; j++)
8
      val += a[i*n + j];
9
    b[i] = val;
10
  }
11
}

1
# sum_rows2 inner loop
2
.L10:
3
  addsd (%rdi), %xmm0 # FP load + add
4
  addq $8, %rdi
5
  cmpq %rax, %rdi
6
  jne .L10

No need to store intermediate results

Optimization Blocker: Memory Aliasing#

Aliasing
- Two different memory references specify single location
- Easy to have happen in C
  - Since allowed to do address arithmetic
  - Direct access to storage structures
- Get in habit of introducing local variables
  - Accumulating within loops
  - Your way of telling compiler not to check for aliasing

Exploiting Instruction‐Level Parallelism#

TODO

Dealing with Conditionals#

TODO

The Memory Hierarchy#

TODO

Cache Memories#

TODO

Linking#

Static Linking#

Programs are translated and linked using a compiler driver: gcc -Og -o prog main.c sum.c

Why Linkers ?#

Reason 1: Modularity
- Program can be written as a collection of smaller source files, rather than one monolithic mass
- Can build libraries of common functions
  - E.g., Math library, Standard C library
Reason 2: Efficiency
- Time: Separate compilation
  - Change one source file, compile, and then relink
  - No need to recompile other source files
- Space: Libraries
  - Common functions can be aggregated into a single file
  - Yet executable files and running memory images contain only code for the functions they actually use

What Do Linkers Do ?#

Step 1: Symbol resolution
- Programs define and reference symbols (global variables and functions)
  - void swap() { ... } /* define symbol swap */
  - swap(); /* reference symbol swap */
  - int *xp = &x; /* define symbol xp, reference x */
- Symbol definitions are stored in object file (by assembler) in symbol table
  - Symbol table is an array of structs
  - Each entry includes name, size, and location of symbol
- During symbol resolution step, the linker associates each symbol reference with exactly one symbol definition
Step 2: Relocation
- Merges separate code and data sections into single sections
- Relocates symbols from their relative locations in the .o files to their final absolute memory locations in the executable
- Updates all references to these symbols to reflect their new positions

Three Kinds of Object Files (Modules)#

Relocatable object file (.o file)
- Contains code and data in a form that can be combined with other relocatable object files to form executable object file
  - Each .o file is produced from exactly one source (.c) file
Executable object file (a.out file)
- Contains code and data in a form that can be copied directly into memory and then executed
Shared object file (.so file)
- Special type of relocatable object file that can be loaded into memory and linked dynamically, at either load time or run-time
- Called Dynamic Link Libraries (DLLs) by Windows

Executable and Linkable Format (ELF)#

Standard binary format for object files
One unified format for
- Relocatable object files (.o)
- Executable object files (a.out)
- Shared object files (.so)
Generic name: ELF binaries

ELF Object File Format#

ELF header
- Word size, byte ordering, file type (.o, exec, .so), machine type, etc
Segment header table (required for executables)
- Page size, virtual addresses memory segments (sections), segment sizes
.text section
- Code
.rodata section
- Read only data: jump tables, …
.data section
- Initialized global variables
.bss section
- Global variables that are uninitialized or initialized to zero
- Block Started by Symbol
- “Better Save Space”
- Has section header but occupies no space
.symtab section
- Symbol table
- Procedure and static variable names
- Section names and locations
.rel.text section
- Relocation info for .text section
- Addresses of instructions that will need to be modified in the executable
- Instructions for modifying
.debug section
- Info for symbolic debugging (gcc -g)
Section header table
- Offsets and sizes of each section

Linker Symbols#

Global symbols
- Symbols defined by module m that can be referenced by other modules
- E.g.: non-static C functions and non-static global variables
External symbols
- Global symbols that are referenced by module m but defined by some other module
Local symbols
- Symbols that are defined and referenced exclusively by module m
- E.g.: C functions and global variables defined with the static attribute
- Local linker symbols are not local program variables

Local non-static C variables vs. local static C variables#

Local non-static C variables stored on the stack
Local static C variables stored in either .bss, or .data

1
int f() {
2
  static int x = 0; /* .bss */
3
  return x;
4
}
5

6
int g() {
7
  static int x = 1; /* .data */
8
  return x;
9
}

In the case above, compiler allocates space in .data for each definition of x and creates local symbols in the symbol table with unique names, e.g., x.1 and x.2.

IMPORTANT
Local static variables are only initialized at first time encountered, calls after won’t reinitialize.

How Linker Resolves Duplicate Symbol Definitions#

Program symbols are either strong or weak
- Strong: procedures and initialized globals
- Weak: uninitialized globals

Linker’s Symbol Rules#

Rule 1: Multiple strong symbols are not allowed
- Each item can be defined only once
- Otherwise: Linker error
Rule 2: Given a strong symbol and multiple weak symbols, choose the strong symbol
- References to the weak symbol resolve to the strong symbol
Rule 3: If there are multiple weak symbols, pick an arbitrary one
- Can override this with gcc –fno-common

Global Variables#

Avoid if you can
Otherwise
- Use static if you can
- Initialize if you define a global variable
- Use extern if you reference an external global variable

Relocation#

Relocation Entries#

1
int array[2] = {1, 2};
2

3
int main() {
4
  int val = sum(array, 2);
5
  return val;
6
}

1
0000000000000020 <main>:
2
  20: be 02 00 00 00        mov    $0x2,%esi
3
  25: 48 8d 3d 00 00 00 00  lea    0x0(%rip),%rdi        # 2c <main+0xc>
4
   28: R_X86_64_PC32 array-0x4
5
  2c: e8 00 00 00 00        call   31 <main+0x11>
6
   2d: R_X86_64_PLT32 sum-0x4
7
  31: c3                    ret

Relocated .text section#

1
0000000000001120 <sum>:
2
    1120: ba 00 00 00 00        mov    $0x0,%edx
3
    1125: b8 00 00 00 00        mov    $0x0,%eax
4
    112a: eb 0d                 jmp    1139 <sum+0x19>
5
    112c: 0f 1f 40 00           nopl   0x0(%rax)
6
    1130: 48 63 c8              movslq %eax,%rcx
7
    1133: 03 14 8f              add    (%rdi,%rcx,4),%edx
8
    1136: 83 c0 01              add    $0x1,%eax
9
    1139: 39 f0                 cmp    %esi,%eax
10
    113b: 7c f3                 jl     1130 <sum+0x10>
11
    113d: 89 d0                 mov    %edx,%eax
12
    113f: c3                    ret
13

14
0000000000001140 <main>:
15
    1140: be 02 00 00 00        mov    $0x2,%esi
16
    1145: 48 8d 3d c4 2e 00 00  lea    0x2ec4(%rip),%rdi        # 4010 <array>
17
    114c: e8 cf ff ff ff        call   1120 <sum>
18
    1151: c3                    ret

Using PC-relative addressing for sum: 0x1120 = 0x1151 + 0xffffffcf

Loading Executable Object Files#

Packaging Commonly Used Functions#

How to package functions commonly used by programmers ?
- Math, I/O, memory management, string manipulation, etc
Awkward, given the linker framework so far:
- Option 1: Put all functions into a single source file
  - Programmers link big object file into their programs
  - Space and time inefficient
- Option 2: Put each function in a separate source file
  - Programmers explicitly link appropriate binaries into their programs
  - More efficient, but burdensome on the programmer

Old-fashioned Solution: Static Libraries#

Static libraries (.a archive files)
- Concatenate related relocatable object files into a single file with an index (called an archive)
- Enhance linker so that it tries to resolve unresolved external references by looking for the symbols in one or more archives
- If an archive member file resolves reference, link it into the executable

Creating Static Libraries#

Archiver allows incremental updates
Recompile function that changes and replace .o file in archive

Linking with Static Libraries#

Using Static Libraries#

Linker’s algorithm for resolving external references
- Scan .o files and .a files in the command line order
- During the scan, keep a list of the current unresolved references
- As each new .o or .a file, obj, is encountered, try to resolve each unresolved reference in the list against the symbols defined in obj
- If any entries in the unresolved list at end of scan, then error
Problem
- Command line order matters!
- Moral: put libraries at the end of the command line

1
unix> gcc -L. libtest.o -lmine
2
unix> gcc -L. -lmine libtest.o
3
libtest.o: In function `main':
4
libtest.o(.text+0x4): undefined reference to `libfun'

Shared Libraries#

Static libraries have the following disadvantages
- Duplication in the stored executables (every function needs libc)
- Duplication in the running executables
- Minor bug fixes of system libraries require each application to explicitly relink
Modern solution: Shared Libraries
- Object files that contain code and data that are loaded and linked into an application dynamically, at either load-time or run-time
- Also called: dynamic link libraries, DLLs, .so files
Dynamic linking can occur when executable is first loaded and run (load-time linking)
- Common case for Linux, handled automatically by the dynamic linker (ld-linux.so) Standard C library (libc.so) usually dynamically linked
Dynamic linking can also occur after program has begun (run-time linking)
- In Linux, this is done by calls to the dlopen() interface
  - Distributing software
  - High-performance web servers
  - Runtime library interpositioning
Shared library routines can be shared by multiple processes

Dynamic Linking at Load-time#

Dynamic Linking at Run-time#

1
#include <stdio.h>
2
#include <stdlib.h>
3
#include <dlfcn.h>
4

5
int x[2] = {1, 2};
6
int y[2] = {3, 4};
7
int z[2];
8

9
int main() {
10
  void *handle;
11
  void (*addvec)(int *, int *, int *, int);
12
  char *error;
13

14
  /* Dynamically load the shared library that contains addvec() */
15
  handle = dlopen("./libvector.so", RTLD_LAZY);
16
  if (!handle) {
17
    fprintf(stderr, "%s\n", dlerror());
18
    exit(1);
19
  }
20

21
  /* Get a pointer to the addvec() function we just loaded */
22
  addvec = dlsym(handle, "addvec");
23
  if ((error = dlerror()) != NULL) {
24
    fprintf(stderr, "%s\n", error);
25
    exit(1);
26
  }
27

28
  /* Now we can call addvec() just like any other function */
29
  addvec(x, y, z, 2);
30
  printf("z = [%d %d]\n", z[0], z[1]);
31

32
  /* Unload the shared library */
33
  if (dlclose(handle) < 0) {
34
    fprintf(stderr, "%s\n", dlerror());
35
    exit(1);
36
  }
37
  return 0;
38
}

Library Interpositioning#

Its a powerful linking technique that allows programmers to intercept calls to arbitrary functions
Interpositioning can occur at:
- Compile time: When the source code is compiled
- Link time: When the relocatable object files are statically linked to form an executable object file
- Load/Run time: When an executable object file is loaded into memory, dynamically linked, and then executed

Example Program#

1
#include <stdio.h>
2
#include <malloc.h>
3

4
int main() {
5
  int *p = malloc(32);
6
  free(p);
7
  return(0);
8
}

Goal: trace the addresses and sizes of the allocated and freed blocks, without breaking the program, and without modifying the source code
Three solutions: interpose on the lib malloc and free functions at compile time, link time, and load/run time

Compile-time Interpositioning#

1
#ifdef COMPILETIME
2
#include <stdio.h>
3
#include <malloc.h>
4

5
/* malloc wrapper function */
6
void *mymalloc(size_t size) {
7
  void *ptr = malloc(size);
8
  printf("malloc(%d)=%p\n", (int)size, ptr);
9
  return ptr;
10
}
11

12
/* free wrapper function */
13
void myfree(void *ptr) {
14
  free(ptr);
15
  printf("free(%p)\n", ptr);
16
}
17
#endif

1
#define malloc(size) mymalloc(size)
2
#define free(ptr) myfree(ptr)
3

4
void *mymalloc(size_t size);
5
void myfree(void *ptr);

1
linux> make intc
2
gcc -Wall -DCOMPILETIME -c mymalloc.c
3
gcc -Wall -I. -o intc int.c mymalloc.o
4
linux> make runc
5
./intc
6
malloc(32)=0x1edc010
7
free(0x1edc010)
8
linux>

Link-time Interpositioning#

1
#ifdef LINKTIME
2
#include <stdio.h>
3

4
void *__real_malloc(size_t size);
5
void __real_free(void *ptr);
6

7
/* malloc wrapper function */
8
void *__wrap_malloc(size_t size) {
9
  void *ptr = __real_malloc(size); /* Call libc malloc */
10
  printf("malloc(%d) = %p\n", (int)size, ptr);
11
  return ptr;
12
}
13

14
/* free wrapper function */
15
void __wrap_free(void *ptr) {
16
  __real_free(ptr); /* Call libc free */
17
  printf("free(%p)\n", ptr);
18
}
19
#endif

1
linux> make intl
2
gcc -Wall -DLINKTIME -c mymalloc.c
3
gcc -Wall -c int.c
4
gcc -Wall -Wl,--wrap,malloc -Wl,--wrap,free -o intl int.o mymalloc.o
5
linux> make runl
6
./intl
7
malloc(32) = 0x1aa0010
8
free(0x1aa0010)
9
linux>

The -Wl flag passes argument to linker, replacing each comma with a space
The --wrap,malloc arg instructs linker to resolve references in a special way:
- Refs to malloc should be resolved as __wrap_malloc
- Refs to __real_malloc should be resolved as malloc

Load/Run time Interpositioning#

1
#ifdef RUNTIME
2
#define _GNU_SOURCE
3
#include <stdio.h>
4
#include <stdlib.h>
5
#include <dlfcn.h>
6

7
/* malloc wrapper function */
8
void *malloc(size_t size) {
9
  void *(*mallocp)(size_t size);
10
  char *error;
11

12
  mallocp = dlsym(RTLD_NEXT, "malloc"); /* Get addr of libc malloc */
13
  if ((error = dlerror()) != NULL) {
14
    fputs(error, stderr);
15
    exit(1);
16
  }
17
  char *ptr = mallocp(size); /* Call libc malloc */
18
  printf("malloc(%d) = %p\n", (int)size, ptr);
19
  return ptr;
20
}
21

22
/* free wrapper function */
23
void free(void *ptr) {
24
  void (*freep)(void *) = NULL;
25
  char *error;
26

27
  if (!ptr)
28
    return;
29

30
  freep = dlsym(RTLD_NEXT, "free"); /* Get address of libc free */
31
  if ((error = dlerror()) != NULL) {
32
    fputs(error, stderr);
33
    exit(1);
34
  }
35
  freep(ptr); /* Call libc free */
36
  printf("free(%p)\n", ptr);
37
}
38
#endif

1
linux> make intr
2
gcc -Wall -DRUNTIME -shared -fpic -o mymalloc.so mymalloc.c -ldl
3
gcc -Wall -o intr int.c
4
linux> make runr
5
(LD_PRELOAD="./mymalloc.so" ./intr)
6
malloc(32) = 0xe60010
7
free(0xe60010)
8
linux>

The LD_PRELOAD environment variable tells the dynamic linker to resolve unresolved refs (e.g., to malloc) by looking in mymalloc.so first

Exceptional Control Flow#

Control Flow#

Processors do only one thing:
- From startup to shutdown, a CPU simply reads and executes (interprets) a sequence of instructions, one at a time
- This sequence is the CPU’s control flow (or flow of control)

Up to now, we have learned two mechanisms for changing control flow:

Jumps and branches
Call and return

They react to changes in program state.

But its insufficient for a useful system: difficult to react to changes in system state:

Data arrives from a disk or a network adapter
Instruction divides by zero
User hits Ctrl-C at the keyboard
System timer expires

That’s why we need mechanisms for “exceptional control flow”.

Exceptional Control Flow#

Exists at all levels of a computer system
Low level mechanisms
- Exceptions
  - Change in control flow in response to a system event (i.e., change in system state)
  - Implemented using combination of hardware and OS software
Higher level mechanisms
- Process context switch
  - Implemented by OS software and hardware timer
- Signals
  - Implemented by OS software
- Nonlocal jumps: setjmp() and longjmp()
  - Implemented by C runtime library

Exceptions#

An exception is a transfer of control to the OS kernel in response to some event (i.e., change in processor state)
- Kernel is the memory-resident part of the OS
- Examples of events: Divide by 0, arithmetic overflow, page fault, I/O request completes, typing Ctrl‐C

Exception Tables#

Synchronous Exceptions#

Caused by events that occur as a result of executing an instruction
- Traps
  - Intentional (e.g., system calls, breakpoint traps, special instructions)
  - Returns control to “next” instruction
- Faults
  - Unintentional but possibly recoverable (e.g., page faults (recoverable), protection faults (unrecoverable), floating point exceptions)
  - Either re-executes faulting (“current”) instruction or aborts
- Aborts
  - Unintentional and unrecoverable (e.g., illegal instruction, parity error, machine check)
  - Aborts current program

Asynchronous Exceptions (Interrupts)#

Caused by events external to the processor
- Indicated by setting the processor’s interrupt pin
- Handler returns to “next” instruction
Examples:
- Timer interrupt
  - Every few ms, an external timer chip triggers an interrupt
  - Used by the kernel to take back control from user programs
- I/O interrupt from external device
  - Hitting Ctrl-C at the keyboard
  - Arrival of a packet from a network
  - Arrival of data from a disk

Processes#

Definition: A process is an instance of a running program
- One of the most profound ideas in computer science
- Not the same as “program” or “processor”
Process provides each program with two key abstractions:
- Logical control flow
  - Each program seems to have exclusive use of the CPU
  - Provided by kernel mechanism called context switching
- Private address space
  - Each program seems to have exclusive use of main memory
  - Provided by kernel mechanism called virtual memory

Multiprocessing: The Illusion#

Computer runs many processes simultaneously
- Applications for one or more users
  - Web browsers, email clients, editors, …
- Background tasks
  - Monitoring network & I/O devices

Multiprocessing: The (Traditional) Reality#

Single processor executes multiple processes concurrently
- Process executions interleaved (multitasking)
- Address spaces managed by virtual memory system
- Register values for non-executing processes saved in memory
Save current registers in memory

Schedule next process for execution
Load saved registers and switch address space (context switch)

Multiprocessing: The (Modern) Reality#

Multicore processors
- Multiple CPUs on single chip
- Share main memory (and some of the caches)
- Each can execute a separate process
  - Scheduling of processors onto cores done by kernel

Concurrent Processes#

Each process is a logical control flow
Two processes run concurrently (are concurrent) if their flows overlap in time
Otherwise, they are sequential
Examples (running on single core):
- Concurrent: A & B, A & C
- Sequential: B & C

User View of Concurrent Processes#

Control flows for concurrent processes are physically disjoint in time
However, we can think of concurrent processes as running in parallel with each other

Context Switching#

Processes are managed by a shared chunk of memory-resident OS code called the kernel
- Important: the kernel is not a separate process, but rather runs as part of some existing process
Control flow passes from one process to another via a context switch

Process Control#

System Call Error Handling#

On error, Linux system-level functions typically return ‐1 and set global variable errno to indicate cause
Hard and fast rule:
- You must check the return status of every system-level function
- Only exception is the handful of functions that return void

1
if ((pid = fork()) < 0) {
2
  fprintf(stderr, "fork error: %s\n", strerror(errno));
3
  exit(0);
4
}

Error-reporting functions#

Can simplify somewhat using an error-reporting function:

1
/* Unix-style error */
2
void unix_error(char *msg) {
3
  fprintf(stderr, "%s: %s\n", msg, strerror(errno));
4
  exit(0);
5
}
6

7
if ((pid = fork()) < 0)
8
  unix_error("fork error");

Error-handling Wrappers#

Simplify the present code even further by using Stevens-style error-handling wrappers:

1
pid_t Fork(void) {
2
  pid_t pid;
3

4
  if ((pid = fork()) < 0)
5
    unix_error("Fork error");
6
  return pid;
7
}
8

9
pid = Fork();

Processes States#

From a programmer’s perspective, we can think of a process as being in one of three states:

Running
- Process is either executing, or waiting to be executed and will eventually be scheduled (i.e., chosen to execute) by the kernel
Stopped
- Process execution is suspended and will not be scheduled until further notice
Terminated
- Process is stopped permanently

Creating Processes#

Parent process creates a new running child process by calling fork
int fork(void)
- Returns 0 to the child process, child’s PID to parent process
- Child is almost identical to parent:
  - Child get an identical (but separate) copy of the parent’s virtual address space
  - Child gets identical copies of the parent’s open file descriptors
  - Child has a different PID than the parent
fork is interesting (and often confusing) because it is called once but returns twice

fork Example#

1
int main() {
2
  pid_t pid;
3
  int x = 1;
4

5
  pid = Fork();
6
  if (pid == 0) { /* Child */
7
    printf("child: x=%d\n", ++x);
8
    exit(0);
9
  }
10

11
  /* Parent */
12
  printf("parent: x=%d\n", --x);
13
  exit(0);
14
}

1
linux> ./fork
2
parent: x=0
3
child: x=2

Call once, return twice
Concurrent execution
- Can’t predict execution order of parent and child
Duplicate but separate address space
- x has a value of 1 when fork returns in parent and child
- Subsequent changes to x are independent
Shared open files
- stdout is the same in both parent and child

Terminating Processes#

Process becomes terminated for one of three reasons:
- Receiving a signal whose default action is to terminate
- Returning from the main routine
- Calling the exit function
void exit(int status)
- Terminates with an exit status of status
- Convention: normal return status is 0, nonzero on error
- Another way to explicitly set the exit status is to return an integer value from the main routine
exit is called once but never returns

Modeling fork with Process Graphs#

A process graph is a useful tool for capturing the partial ordering of statements in a concurrent program:
- Each vertex is the execution of a statement
- a ‐> b means a happens before b
- Edges can be labeled with current value of variables
- printf vertices can be labeled with output
- Each graph begins with a vertex with no inedges
Any topological sort of the graph corresponds to a feasible total ordering
- Total ordering of vertices where all edges point from left to right

Process Graph Example#

1
int main() {
2
  pid_t pid;
3
  int x = 1;
4

5
  pid = Fork();
6
  if (pid == 0) { /* Child */
7
    printf("child: x=%d\n", ++x);
8
    exit(0);
9
  }
10

11
  /* Parent */
12
  printf("parent: x=%d\n", --x);
13
  exit(0);
14
}

Interpreting Process Graphs#

Original graph:

Relabled graph:

Feasible total ordering:

Infeasible total ordering:

fork Example: Two consecutive forks#

1
void fork2() {
2
  printf("L0\n");
3
  fork();
4
  printf("L1\n");
5
  fork();
6
  printf("Bye\n");
7
}

fork Example: Nested forks in parent#

1
void fork4() {
2
  printf("L0\n");
3
  if (fork() != 0) {
4
    printf("L1\n");
5
    if (fork() != 0) {
6
      printf("L2\n");
7
    }
8
  }
9
  printf("Bye\n");
10
}

fork Example: Nested forks in children#

1
void fork5() {
2
  printf("L0\n");
3
  if (fork() == 0) {
4
    printf("L1\n");
5
    if (fork() == 0) {
6
      printf("L2\n");
7
    }
8
  }
9
  printf("Bye\n");
10
}

Reaping Child Processes#

Idea
- When process terminates, it still consumes system resources (e.g., Exit status, various OS tables)
- Called a zombie (Living corpse, half alive and half dead)
Reaping
- Performed by parent on terminated child (using wait or waitpid)
- Parent is given exit status information
- Kernel then deletes zombie child process
What if parent doesn’t reap ?
- If any parent terminates without reaping a child, then the orphaned child will be reaped by init process (pid == 1)
- So, only need explicit reaping in long-running processes (e.g., shells and servers)

Zombie Example#

1
void fork7() {
2
  if (fork() == 0) {
3
    /* Child */
4
    printf("Terminating Child, PID = %d\n", getpid());
5
    exit(0);
6
  } else {
7
    printf("Running Parent, PID = %d\n", getpid());
8
    while (1)
9
      ; /* Infinite loop */
10
  }
11
}

1
linux> ./forks 7 &
2
[1] 6639
3
Running Parent, PID = 6639
4
Terminating Child, PID = 6640
5
linux> ps
6
PID TTY TIME CMD
7
6585 ttyp9 00:00:00 tcsh
8
6639 ttyp9 00:00:03 forks
9
6640 ttyp9 00:00:00 forks <defunct>
10
6641 ttyp9 00:00:00 ps
11
linux> kill 6639
12
[1] Terminated
13
linux> ps
14
PID TTY TIME CMD
15
6585 ttyp9 00:00:00 tcsh
16
6642 ttyp9 00:00:00 ps

ps shows child process as “defunct” (i.e., a zombie)
Killing parent allows child to be reaped by init

Non-terminating Child Example#

1
void fork8() {
2
  if (fork() == 0) {
3
    /* Child */
4
    printf("Running Child, PID = %d\n", getpid());
5
    while (1)
6
      ; /* Infinite loop */
7
  } else {
8
    printf("Terminating Parent, PID = %d\n", getpid());
9
    exit(0);
10
  }
11
}

1
linux> ./forks 8
2
Terminating Parent, PID = 6675
3
Running Child, PID = 6676
4
linux> ps
5
PID TTY TIME CMD
6
6585 ttyp9 00:00:00 tcsh
7
6676 ttyp9 00:00:06 forks
8
6677 ttyp9 00:00:00 ps
9
linux> kill 6676
10
linux> ps
11
PID TTY TIME CMD
12
6585 ttyp9 00:00:00 tcsh
13
6678 ttyp9 00:00:00 ps

Child process still active even though parent has terminated
Must kill child explicitly, or else will keep running indefinitely

wait: Synchronizing with Children#

Parent reaps a child by calling the wait function
int wait(int *child_status)
- Suspends current process until one of its children terminates
- Return value is the pid of the child process that terminated
- If child_status != NULL, then the integer it points to will be set to a value that indicates reason the child terminated and the exit status:
  - Checked using macros defined in wait.h
    - WIFEXITED, WEXITSTATUS, WIFSIGNALED, WTERMSIG, WIFSTOPPED, WSTOPSIG, WIFCONTINUED

wait Example#

1
void fork9() {
2
  int child_status;
3

4
  if (fork() == 0) {
5
    printf("HC: hello from child\n");
6
    exit(0);
7
  } else {
8
    printf("HP: hello from parent\n");
9
    wait(&child_status);
10
    printf("CT: child has terminated\n");
11
  }
12
  printf("Bye\n");
13
}

Another wait Example#

If multiple children completed, will take in arbitrary order
Can use macros WIFEXITED and WEXITSTATUS to get information about exit status

1
void fork10() {
2
  pid_t pid[N];
3
  int i, child_status;
4

5
  for (i = 0; i < N; i++)
6
    if ((pid[i] = fork()) == 0)
7
      exit(100+i); /* Child */
8
  for (i = 0; i < N; i++) { /* Parent */
9
    pid_t wpid = wait(&child_status);
10
  if (WIFEXITED(child_status))
11
    printf("Child %d terminated with exit status %d\n", wpid, WEXITSTATUS(child_status));
12
  else
13
    printf("Child %d terminate abnormally\n", wpid);
14
  }
15
}

waitpid: Waiting for a Specific Process#

pid_t waitpid(pid_t pid, int &status, int options)
- Suspends current process until specific process terminates

1
void fork11() {
2
  pid_t pid[N];
3
  int i;
4
  int child_status;
5

6
  for (i = 0; i < N; i++)
7
    if ((pid[i] = fork()) == 0)
8
      exit(100+i); /* Child */
9
  for (i = N-1; i >= 0; i--) {
10
    pid_t wpid = waitpid(pid[i], &child_status, 0);
11
    if (WIFEXITED(child_status))
12
      printf("Child %d terminated with exit status %d\n", wpid, WEXITSTATUS(child_status));
13
    else
14
      printf("Child %d terminate abnormally\n", wpid);
15
    }
16
}

execve: Loading and Running Programs#

int execve(char *filename, char *argv[], char *envp[])
Loads and runs in the current process:
- Executable file filename
  - Can be object file or script file beginning with #!interpreter
- …with argument list argv
  - By convention argv[0] == filename
- …and environment variable list envp
  - name=value strings (e.g., USER=root)
  - getenv, putenv, printenv
Overwrites code, data, and stack
- Retains PID, open files and signal context
Called once and never returns
- …except if there is an error

Structure of the stack when a new program starts#

Signals#

Linux Process Hierarchy#

TIP
You can view the hierarchy using the pstree command.

Shell Programs#

A shell is an application program that runs programs on behalf of the user
- sh: Original Unix shell (Stephen Bourne, AT&T Bell Labs, 1977)
- csh/tcsh: BSD Unix C shell
- bash: “Bourne-Again” Shell (default Linux shell)
Execution is a sequence of read/evaluate steps

1
int main() {
2
  char cmdline[MAXLINE]; /* command line */
3

4
  while (1) {
5
    /* read */
6
    printf("> ");
7
    fgets(cmdline, MAXLINE, stdin);
8

9
    if (feof(stdin))
10
      exit(0);
11

12
    /* evaluate */
13
    eval(cmdline);
14
  }
15
}

1
void eval(char *cmdline) {
2
  char *argv[MAXARGS]; /* Argument list execve() */
3
  char buf[MAXLINE];   /* Holds modified command line */
4
  int bg;              /* Should the job run in bg or fg? */
5
  pid_t pid;           /* Process id */
6

7
  strcpy(buf, cmdline);
8
  bg = parseline(buf, argv);
9

10
  if (argv[0] == NULL)
11
    return; /* Ignore empty lines */
12

13
  if (!builtin_command(argv)) {
14
    if ((pid = Fork()) == 0) { /* Child runs user job */
15
      if (execve(argv[0], argv, environ) < 0) {
16
        printf("%s: Command not found.\n", argv[0]);
17
        exit(0);
18
      }
19
    }
20

21
    /* Parent waits for foreground job to terminate */
22
    if (!bg) {
23
      int status;
24
      if (waitpid(pid, &status, 0) < 0)
25
        unix_error("waitfg: waitpid error");
26
    }
27
    else
28
      printf("%d %s", pid, cmdline);
29
  }
30
  return;
31
}

Problem with Simple Shell Example#

Our example shell correctly waits for and reaps foreground jobs
But what about background jobs ?
- Will become zombies when they terminate
- Will never be reaped because shell (typically) will not terminate
- Will create a memory leak that could run the kernel out of memory

Solution: Exceptional Control Flow#

The kernel will interrupt regular processing to alert us when a background process completes
In Unix, the alert mechanism is called a signal

Signals#

A signal is a small message that notifies a process that an event of some type has occurred in the system
- Akin to exceptions and interrupts
- Sent from the kernel (sometimes at the request of another process) to a process
- Signal type is identified by small integer ID’s (1-30)
- Only information in a signal is its ID and the fact that it arrived

ID	Name	Default Action	Corresponding Event
2	SIGINT	Terminate	User typed Ctrl-C
9	SIGKILL	Terminate	Kill program (cannot override or ignore)
11	SIGSEGV	Terminate & Dump	Segmentation violation
14	SIGALARM	Terminate	Timer signal
17	SIGCHLD	Ignore	Child stopped or terminated

Sending a Signal#

Kernel sends (delivers) a signal to a destination process by updating some state in the context of the destination process
Kernel sends a signal for one of the following reasons:
- Kernel has detected a system event such as divide-by-zero (SIGFPE) or the termination of a child process (SIGCHLD)
- Another process has invoked the kill system call to explicitly request the kernel to send a signal to the destination process

Receiving a Signal#

A destination process receives a signal when it is forced by the kernel to react in some way to the delivery of the signal
Some possible ways to react:
- Ignore the signal (do nothing)
- Terminate the process (with optional core dump)
- Catch the signal by executing a user-level function called signal handler
  - Akin to a hardware exception handler being called in response to an asynchronous interrupt:

Pending and Blocked Signals#

A signal is pending if sent but not yet received
- There can be at most one pending signal of any particular type
- Signals are not queued
  - For each signal type, one bit indicates whether or not signal is pending
  - Thus at most one pending signal of any particular type
  - Then subsequent signals of the same type that are sent to that process are discarded
A process can block the receipt of certain signals
- Blocked signals can be delivered, but will not be received until the signal is unblocked
A pending signal is received at most once

Pending/Blocked Bits#

Kernel maintains pending and blocked bit vectors in the context of each process
- pending: represents the set of pending signals
  - Kernel sets bit $k$ in pending when a signal of type $k$ is delivered
  - Kernel clears bit $k$ in pending when a signal of type $k$ is received
- blocked: represents the set of blocked signals
  - Can be set and cleared by using the sigprocmask function
  - Also referred to as the signal mask

Sending Signals#

Process Groups#

Every process belongs to exactly one process group

getpgrp() Return process group of current process
setpgid() Change process group of a process

/bin/kill Program#

/bin/kill program sends arbitrary signal to a process or process group
Examples
- /bin/kill –9 24818 Send SIGKILL to process 24818
- /bin/kill –9 –24817 Send SIGKILL to every process in process group 24817

1
linux> ./forks 16
2
Child1: pid=24818 pgrp=24817
3
Child2: pid=24819 pgrp=24817
4
linux> ps
5
PID TTY TIME CMD
6
24788 pts/2 00:00:00 tcsh
7
24818 pts/2 00:00:02 forks
8
24819 pts/2 00:00:02 forks
9
24820 pts/2 00:00:00 ps
10
linux> /bin/kill -9 -24817
11
linux> ps
12
PID TTY TIME CMD
13
24788 pts/2 00:00:00 tcsh
14
24823 pts/2 00:00:00 ps
15
linux>

From the Keyboard#

Typing Ctrl-C (Ctrl-Z) causes the kernel to send a SIGINT (SIGTSTP) to every job in the foreground process group
- SIGINT – default action is to terminate each process
- SIGTSTP – default action is to stop (suspend) each process

Example of Ctrl-C and Ctrl-Z#

STAT (process state) Legend:
- First letter:
  - S: Sleeping
  - T: Stopped
  - R: Running
- Second letter:
  - s: Session leader
  - +: Foreground proc group

1
bluefish> ./forks 17
2
Child: pid=28108 pgrp=28107
3
Parent: pid=28107 pgrp=28107
4
<types ctrl-z>
5
Suspended
6
bluefish> ps w
7
PID TTY STAT TIME COMMAND
8
27699 pts/8 Ss 0:00 -tcsh
9
28107 pts/8 T 0:01 ./forks 17
10
28108 pts/8 T 0:01 ./forks 17
11
28109 pts/8 R+ 0:00 ps w
12
bluefish> fg
13
./forks 17
14
<types ctrl-c>
15
bluefish> ps w
16
PID TTY STAT TIME COMMAND
17
27699 pts/8 Ss 0:00 -tcsh
18
28110 pts/8 R+ 0:00 ps w

kill Function#

1
void fork12() {
2
  pid_t pid[N];
3
  int i;
4
  int child_status;
5

6
  for (i = 0; i < N; i++)
7
    if ((pid[i] = fork()) == 0) {
8
      /* Child: Infinite Loop */
9
      while(1)
10
        ;
11
    }
12

13
  for (i = 0; i < N; i++) {
14
    printf("Killing process %d\n", pid[i]);
15
    kill(pid[i], SIGINT);
16
  }
17

18
  for (i = 0; i < N; i++) {
19
    pid_t wpid = wait(&child_status);
20
    if (WIFEXITED(child_status))
21
      printf("Child %d terminated with exit status %d\n", wpid, WEXITSTATUS(child_status));
22
    else
23
      printf("Child %d terminated abnormally\n", wpid);
24
  }
25
}

Receiving Signals#

Suppose kernel is returning from an exception handler and is ready to pass control to process $p$

IMPORTANT
All context switches are initiated by calling some exception handler.

Kernel computes pnb = pending & ~blocked
- The set of pending nonblocked signals for process $p$
If pnb == 0
- Pass control to next instruction in the logical flow for $p$
Else
- Choose least nonzero bit $k$ in pnb and force process $p$ to receive signal $k$
- The receipt of the signal triggers some action by $p$
- Repeat for all nonzero $k$ in pnb
- Pass control to next instruction in logical flow for $p$

Default Actions#

Each signal type has a predefined default action, which is one of:
- The process terminates
- The process terminates and dumps core
- The process stops until restarted by a SIGCONT signal
- The process ignores the signal

Installing Signal Handlers#

The signal function modifies the default action associated with the receipt of signal signum:
- handler_t *signal(int signum, handler_t *handler)
Different values for handler:
- SIG_IGN: Ignore signals of type signum
- SIG_DFL: Revert to the default action on receipt of signals of type signum
- Otherwise, handler is the address of a user-level signal handler
  - Called when process receives signal of type signum
  - Referred to as “installing” the handler
  - Executing handler is called “catching” or “handling” the signal
  - When the handler executes its return statement, control passes back to instruction in the control flow of the process that was interrupted by receipt of the signal

1
/* SIGINT handler */
2
void sigint_handler(int sig) {
3
  printf("So you think you can stop the bomb with ctrl-c, do you?\n");
4
  sleep(2);
5
  printf("Well...");
6
  fflush(stdout);
7
  sleep(1);
8
  printf("OK. :-)\n");
9
  exit(0);
10
}
11

12
int main() {
13
  /* Install the SIGINT handler */
14
  if (signal(SIGINT, sigint_handler) == SIG_ERR)
15
    unix_error("signal error");
16

17
  /* Wait for the receipt of a signal */
18
  pause();
19

20
  return 0;
21
}

Signals Handlers as Concurrent Flows#

A signal handler is a separate logical flow (not process) that runs concurrently with the main program

Nested Signal Handlers#

Handlers can be interrupted by other handlers

Blocking and Unblocking Signals#

Implicit blocking mechanism
- Kernel blocks any pending signals of type currently being handled
- E.g., A SIGINT handler can’t be interrupted by another SIGINT
Explicit blocking and unblocking mechanism
- sigprocmask function
Supporting functions
- sigemptyset – Create empty set
- sigfillset – Add every signal number to set
- sigaddset – Add signal number to set
- sigdelset – Delete signal number from set

Temporarily Blocking Signals#

1
sigset_t mask, prev_mask;
2

3
sigemptyset(&mask);
4
sigaddset(&mask, SIGINT);
5

6
/* Block SIGINT and save previous blocked set */
7
sigprocmask(SIG_BLOCK, &mask, &prev_mask);
8

9
/* Code region that will not be interrupted by SIGINT */
10

11
/* Restore previous blocked set, unblocking SIGINT */
12
sigprocmask(SIG_SETMASK, &prev_mask, NULL);

Safe Signal Handling Guidelines#

Handlers are tricky because they are concurrent with main program and share the same global data structures. Shared data structures can become corrupted.

G0: Keep your handlers as simple as possible
- E.g., Set a global flag and return
G1: Call only async-signal-safe functions in your handlers
- printf, sprintf, malloc, and exit are not safe !
G2: Save and restore errno on entry and exit
- So that other handlers don’t overwrite your value of errno
G3: Protect accesses to shared data structures by temporarily blocking all signals
- To prevent possible corruption
G4: Declare global variables as volatile
- To prevent compiler from storing them in a register
G5: Declare global flags as volatile sig_atomic_t
- flag: variable that is only read or written (e.g. flag = 1, not flag++)
- Flag declared this way does not need to be protected like other globals

Async-Signal-Safety#

Function is async-signal-safe if either reentrant (e.g., all variables stored on stack frame) or non-interruptible by signals
Posix guarantees 117 functions to be async-signal-safe
- Source: man 7 signal
- Popular functions on the list:
  - _exit, write, wait, waitpid, sleep, kill
- Popular functions that are not on the list:
  - printf, sprintf, malloc, exit
  - Unfortunate fact: write is the only async-signal-safe output function

Correct Signal Handling Example#

You can’t use signals to count events, such as children terminating.

1
#define N 5
2

3
int ccount = 0;
4

5
void child_handler(int sig) {
6
  int olderrno = errno;
7
  pid_t pid;
8

9
  if ((pid = wait(NULL)) < 0)
10
    Sio_error("wait error");
11

12
  ccount--;
13
  Sio_puts("Handler reaped child ");
14
  Sio_putl((long)pid);
15
  Sio_puts(" \n");
16
  sleep(1);
17
  errno = olderrno;
18
}
19

20
void fork14() {
21
  pid_t pid[N];
22
  int i;
23
  ccount = N;
24

25
  signal(SIGCHLD, child_handler);
26

27
  for (i = 0; i < N; i++) {
28
    if ((pid[i] = fork()) == 0) {
29
      sleep(1);
30
      exit(0); /* Child exits */
31
    }
32
  }
33

34
  while (ccount > 0) /* Parent spins */
35
    ;
36
}

1
whaleshark> ./forks 14
2
Handler reaped child 23240
3
Handler reaped child 23241

Must wait for all terminated child processes. Put wait in a loop to reap all terminated children.

1
void child_handler2(int sig) {
2
  int olderrno = errno;
3
  pid_t pid;
4

5
  while ((pid = wait(NULL)) > 0) {
6
    ccount--;
7
    Sio_puts("Handler reaped child ");
8
    Sio_putl((long)pid);
9
    Sio_puts(" \n");
10
  }
11
  if (errno != ECHILD)
12
    Sio_error("wait error");
13
  errno = olderrno;
14
}

1
whaleshark> ./forks 15
2
Handler reaped child 23246
3
Handler reaped child 23247
4
Handler reaped child 23248
5
Handler reaped child 23249
6
Handler reaped child 23250
7
whaleshark>

Portable Signal Handling#

Ugh ! Different versions of Unix can have different signal handling semantics
- Some older systems restore action to default after catching signal
- Some interrupted system calls can return with errno == EINTR
- Some systems don’t block signals of the type being handled
Solution: sigaction

1
handler_t *signal(int signum, handler_t *handler) {
2
  struct sigaction action, old_action;
3

4
  action.sa_handler = handler;
5
  sigemptyset(&action.sa_mask); /* Block sigs of type being handled */
6
  action.sa_flags = SA_RESTART; /* Restart syscalls if possible */
7

8
  if (sigaction(signum, &action, &old_action) < 0)
9
    unix_error("Signal error");
10
  return (old_action.sa_handler);
11
}

Synchronizing Flows to Avoid Races#

Simple shell with a subtle synchronization error because it assumes parent runs before child

1
int main(int argc, char **argv) {
2
  int pid;
3
  sigset_t mask_all, prev_all;
4

5
  sigfillset(&mask_all);
6
  signal(SIGCHLD, handler);
7
  initjobs(); /* Initialize the job list */
8

9
  while (1) {
10
    if ((pid = Fork()) == 0) { /* Child */
11
      execve("/bin/date", argv, NULL);
12
    }
13
    sigprocmask(SIG_BLOCK, &mask_all, &prev_all); /* Parent */
14
    addjob(pid); /* Add the child to the job list */
15
    sigprocmask(SIG_SETMASK, &prev_all, NULL);
16
  }
17
  exit(0);
18
}

1
void handler(int sig) {
2
  int olderrno = errno;
3
  sigset_t mask_all, prev_all;
4
  pid_t pid;
5

6
  sigfillset(&mask_all);
7
  while ((pid = waitpid(-1, NULL, 0)) > 0) { /* Reap child */
8
    sigprocmask(SIG_BLOCK, &mask_all, &prev_all);
9
    deletejob(pid); /* Delete the child from the job list */
10
    sigprocmask(SIG_SETMASK, &prev_all, NULL);
11
  }
12
  if (errno != ECHILD)
13
    Sio_error("waitpid error");
14
  errno = olderrno;
15
}

Corrected Shell Program without Race#

1
int main(int argc, char **argv) {
2
  int pid;
3
  sigset_t mask_all, mask_one, prev_one;
4

5
  sigfillset(&mask_all);
6
  sigemptyset(&mask_one);
7
  sigaddset(&mask_one, SIGCHLD);
8
  signal(SIGCHLD, handler);
9
  initjobs(); /* Initialize the job list */
10

11
  while (1) {
12
    sigprocmask(SIG_BLOCK, &mask_one, &prev_one); /* Block SIGCHLD */
13
    if ((pid = Fork()) == 0) { /* Child process */
14
      sigprocmask(SIG_SETMASK, &prev_one, NULL); /* Unblock SIGCHLD */
15
      execve("/bin/date", argv, NULL);
16
    }
17
    sigprocmask(SIG_BLOCK, &mask_all, NULL); /* Parent process */
18
    addjob(pid); /* Add the child to the job list */
19
    sigprocmask(SIG_SETMASK, &prev_one, NULL); /* Unblock SIGCHLD */
20
  }
21
  exit(0);
22
}

Explicitly Waiting for Signals#

Handlers for program explicitly waiting for SIGCHLD to arrive

1
volatile sig_atomic_t pid;
2

3
void sigchld_handler(int s) {
4
  int olderrno = errno;
5
  pid = waitpid(-1, NULL, 0); /* Main is waiting for nonzero pid */
6
  errno = olderrno;
7
}
8

9
void sigint_handler(int s) {}

Similar to a shell waiting for a foreground job to terminate:

1
int main(int argc, char **argv) {
2
  sigset_t mask, prev;
3

4
  signal(SIGCHLD, sigchld_handler);
5
  signal(SIGINT, sigint_handler);
6
  sigemptyset(&mask);
7
  sigaddset(&mask, SIGCHLD);
8

9
  while (1) {
10
    sigprocmask(SIG_BLOCK, &mask, &prev); /* Block SIGCHLD */
11
    if (fork() == 0) /* Child */
12
      exit(0);
13

14
    /* Parent */
15
    pid = 0;
16
    sigprocmask(SIG_SETMASK, &prev, NULL); /* Unblock SIGCHLD */
17

18
    /* Wait for SIGCHLD to be received (wasteful!) */
19
    while (!pid)
20
      ;
21

22
    /* Do some work after receiving SIGCHLD */
23
    printf(".");
24
  }
25
  exit(0);
26
}

Program is correct, but very wasteful

Other options:

1
while (!pid) /* Race! */
2
  pause();

1
while (!pid) /* Too slow! */
2
  sleep(1);

Solution: sigsuspend

Waiting for Signals with sigsuspend#

int sigsuspend(const sigset_t *mask)
- It’ll temporarily use const sigset_t *mask instead of currently signal block mask, then hang on program. Once catch signal, revert the signal block mask to original one

1
int main(int argc, char **argv) {
2
  sigset_t mask, prev;
3

4
  signal(SIGCHLD, sigchld_handler);
5
  signal(SIGINT, sigint_handler);
6
  sigemptyset(&mask);
7
  sigaddset(&mask, SIGCHLD);
8

9
  while (1) {
10
    sigprocmask(SIG_BLOCK, &mask, &prev); /* Block SIGCHLD */
11
    if (fork() == 0) /* Child */
12
      exit(0);
13

14
    /* Wait for SIGCHLD to be received */
15
    pid = 0;
16
    while (!pid)
17
      sigsuspend(&prev);
18

19
    /* Optionally unblock SIGCHLD */
20
    sigprocmask(SIG_SETMASK, &prev, NULL);
21
    /* Do some work after receiving SIGCHLD */
22
    printf(".");
23
  }
24
  exit(0);
25
}

Nonlocal Jumps#

Powerful (but dangerous) user-level mechanism for transferring control to an arbitrary location
- Controlled to way to break the procedure call / return discipline
- Useful for error recovery and signal handling
int setjmp(jmp_buf buf)
- Must be called before longjmp
- Identifies a return site for a subsequent longjmp
- Called once, returns one or more times
- Implementation:
  - Remember where you are by storing the current register context, stack pointer, and PC value in buf
  - Return 0
int sigsetjmp(sigjmp_buf buf, int save);
- Similar as setjmp, save indicates whether save signal block mask (can only be 0 (don’t save) or 1 (save))
void longjmp(jmp_buf buf, int val)
- Meaning:
  - return from the setjmp remembered by jump buffer buf again…
  - …this time returning val instead of 0 (if val is 0, force return 1)
- Called after setjmp
- Called once, but never returns
- Implementation:
  - Restore register context, stack pointer, base pointer, PC value from jump buffer buf
  - Set %eax to val
  - Jump to the location indicated by the PC stored in jump buf buf
void siglongjmp(sigjmp_buf buf, int val);
- Same as longjmp, just using with sigsetjmp

setjmp/longjmp Example#

Goal: return directly to original caller from a deeply-nested function

1
jmp_buf buf;
2

3
int error1 = 0;
4
int error2 = 1;
5

6
void foo(void), bar(void);
7

8
int main() {
9
  switch(setjmp(buf)) {
10
  case 0:
11
    foo();
12
    break;
13
  case 1:
14
    printf("Detected an error1 condition in foo\n");
15
    break;
16
  case 2:
17
    printf("Detected an error2 condition in foo\n");
18
    break;
19
  default:
20
    printf("Unknown error condition in foo\n");
21
  }
22
  exit(0);
23
}
24

25
/* Deeply nested function foo */
26
void foo(void) {
27
  if (error1)
28
    longjmp(buf, 1);
29
  bar();
30
}
31

32
void bar(void) {
33
  if (error2)
34
    longjmp(buf, 2);
35
}
36

37
// Output: Detected an error2 condition in foo

sigsetjmp/siglongjmp Example#

This program restarts itself when Ctrl-C’d:

1
sigjmp_buf buf;
2

3
void handler(int sig) {
4
  siglongjmp(buf, 1);
5
}
6

7
int main() {
8
  if (!sigsetjmp(buf, 1)) {
9
    signal(SIGINT, handler);
10
    Sio_puts("starting\n");
11
  }
12
  else
13
    Sio_puts("restarting\n");
14

15
  while(1) {
16
    sleep(1);
17
    Sio_puts("processing...\n");
18
  }
19
  exit(0); /* Control never reaches here */
20
}

Limitations of Nonlocal Jumps#

Works within stack discipline
- Can only long jump to environment of function that has been called but not yet completed

1
jmp_buf env;
2

3
P1() {
4
  if (setjmp(env)) {
5
    /* Long Jump to here */
6
  } else {
7
    P2();
8
  }
9
}
10
P2() { . . . P2(); . . . P3(); }
11
P3() {
12
  longjmp(env, 1);
13
}

1
jmp_buf env;
2

3
P1() {
4
  P2();
5
  P3();
6
}
7

8
P2() {
9
  if (setjmp(env)) {
10
    /* Long Jump to here */
11
  }
12
}
13

14
P3() {
15
  longjmp(env, 1);
16
}

Virtual Memory#

TODO

Dynamic Memory Allocation#

Assumptions for following examples:
- Memory is word addressed
- Words are int-sized

Basic Concepts#

Programmers use dynamic memory allocators (such as malloc) to acquire virtual memory at runtime
- For data structures whose size is only known at runtime
Dynamic memory allocators manage an area of process virtual memory known as the heap

Allocator maintains heap as collection of variable sized blocks, which are either allocated or free
Types of allocators
- Explicit allocator: application allocates and frees space
  - E.g., malloc and free in C
- Implicit allocator: application allocates, but does not free space
  - E.g. garbage collection in Java, ML, and Lisp

The malloc Package#

void *malloc(size_t size)
- Successful:
  - Returns a pointer to a memory block of at least size bytes aligned to an 8-byte (x86) or 16-byte (x86-64) boundary
  - If size == 0, returns NULL
- Unsuccessful: returns NULL (0) and sets errno
void free(void *p)
- Returns the block pointed at by p to pool of available memory
- p must come from a previous call to malloc or realloc
Other functions
- calloc: Version of malloc that initializes allocated block to zero
- realloc: Changes the size of a previously allocated block
- sbrk: Used internally by allocators to grow or shrink the heap

1
#include <stdio.h>
2
#include <stdlib.h>
3

4
void foo(int n) {
5
  int i, *p;
6

7
  /* Allocate a block of n ints */
8
  p = (int *)malloc(n * sizeof(int));
9
  if (p == NULL) {
10
    perror("malloc");
11
    exit(0);
12
  }
13

14
  /* Initialize allocated block */
15
  for (i = 0; i < n; i++)
16
    p[i] = i;
17

18

19
  /* Return allocated block to the heap */
20
  free(p);
21
}

Constrains#

Applications
- Can issue arbitrary sequence of malloc and free requests
- free request must be to a malloc’d block
Allocators
- Can’t control number or size of allocated blocks
- Must respond immediately to malloc requests
  - i.e., can’t reorder or buffer requests
- Must allocate blocks from free memory
  - i.e., can only place allocated blocks in free memory
- Must align blocks so they satisfy all alignment requirements
  - 8-byte (x86) or 16-byte (x86-64) alignment on Linux boxes
- Can manipulate and modify only free memory
- Can’t move the allocated blocks once they are malloc’d
  - i.e., compaction is not allowed

Performance Goal#

Goals: maximize throughput and peak memory utilization
- These goals are often conflicting

Throughput#

Given some sequence of malloc and free requests:
- $R_{0} ,R_{1} ,...,R_{k} ,...,R_{n-1}$
Throughput
- Number of completed requests per unit time
- Example:
  - 5,000 malloc calls and 5,000 free calls in 10 seconds
  - Throughput is 1,000 operations/second

Peak Memory Utilization#

Given some sequence of malloc and free requests:
- $R_{0} ,R_{1} ,...,R_{k} ,...,R_{n-1}$
Def: Aggregate payload $P_{k}$ $P_{k}$
- malloc(p) results in a block with a payload of p bytes
- After request $R_{k}$ has completed, the aggregate payload $P_{k}$ is the sum of currently allocated payloads
Def: Current heap size $H_{k}$ $H_{k}$
- Assume $H_{k}$ $H_{k}$ is monotonically nondecreasing
  - i.e., heap only grows when allocator uses sbrk
Def: Peak memory utilization after $k+1$ $k + 1$ requests
- $\displaystyle U_{k} =( max_{i\leqslant k} \ P_{i}) /H_{k}$

Fragmentation#

Poor memory utilization caused by fragmentation
- internal fragmentation
- external fragmentation

Internal Fragmentation#

For a given block, internal fragmentation occurs if payload is smaller than block size

Caused by
- Overhead of maintaining heap data structures
- Padding for alignment purposes
- Explicit policy decisions
  - E.g., to return a big block to satisfy a small request
Depends only on the pattern of previous requests
- Thus, easy to measure

External Fragmentation#

Occurs when there is enough aggregate heap memory, but no single free block is large enough

Depends on the pattern of future requests
- Thus, difficult to measure

Implementation Issues#

How do we know how much memory to free given just a pointer ?
How do we keep track of the free blocks ?
What do we do with the extra space when allocating a structure that is smaller than the free block it is placed in ?
How do we pick a block to use for allocation — many might fit ?
How do we reinsert freed block ?

Knowing How Much to Free#

Standard method
- Keep the length of a block in the word preceding the block
  - This word is often called the header field or header
- Requires an extra word for every allocated block

Keeping Track of Free Blocks#

Method 1: Implicit list using length-links all blocks

Method 2: Explicit list among the free blocks using pointers

Method 3: Segregated free list
- Different free lists for different size classes
Method 4: Blocks sorted by size
- Can use a balanced tree (e.g. Red-Black tree) with pointers within each free block, and the length used as a key

Implicit List#

For each block we need both size and allocation status
- Could store this information in two words: wasteful !
Standard trick
- If blocks are aligned, some low-order address bits are always 0
- Instead of storing an always-0 bit, use it as a allocated/free flag
- When reading size word, must mask out this bit

Detailed Implicit Free List Example#

Payload must be double-word aligned

We have an internal fragmentation in first allocated block because of payload is only 2 words but allocated 4 words

Finding a Free Block#

First fit
- Search list from beginning, choose first free block that fits:

1
p = start;
2
while ((p < end) &&  // not passed end
3
      ((*p & 1) ||   // already allocated
4
      (*p <= len)))  // too small
5
  p = p + (*p & -2); // goto next block (word addressed)

Can take linear time in total number of blocks (allocated and free)
In practice it can cause “splinters” at beginning of list
Next fit
- Like first fit, but search list starting where previous search finished
- Should often be faster than first fit: avoids re-scanning unhelpful blocks
- Some research suggests that fragmentation is worse
Best fit
- Search the list, choose the best free block: fits, with fewest bytes left over
- Keeps fragments small — usually improves memory utilization
- Will typically run slower than first fit

Allocating in Free Block#

Splitting
- Since allocated space might be smaller than free space, we might want to split the block

1
void addblock(ptr p, int len) {
2
  int newsize = ((len + 1) >> 1) << 1; // round up to even
3
  int oldsize = *p & -2;               // mask out low bit
4
  *p = newsize | 1;                    // set new length
5
  if (newsize < oldsize)
6
    *(p+newsize) = oldsize - newsize;  // set length in remaining
7
}

Freeing a Block#

Simplest implementation
- Need only clear the allocated flag
  - void free_block(ptr p) { *p = *p & -2 }
- But can lead to “false fragmentation”

There is enough free space, but the allocator won’t be able to find it

Coalescing#

Join (coalesce) with next/previous blocks, if they are free
- Coalescing with next block

1
void free_block(ptr p) {
2
  *p = *p & -2;         // clear allocated flag
3
  next = p + *p;        // find next block
4
  if ((*next & 1) == 0)
5
    *p = *p + *next;    // add to this block if
6
}                       // not allocated

But how do we coalesce with previous block ?

Bidirectional Coalescing#

Boundary tags [Knuth73]
- Replicate size/allocated word at “bottom” (end) of free blocks
- Allows us to traverse the “list” backwards, but requires extra space
- Important and general technique !

Constant Time Coalescing#

Case 1#

Case 2#

Case 3#

Case 4#

Disadvantages of Boundary Tags#

Internal fragmentation
malloc’d blocks don’t need the footer tag

Implicit Lists Summary#

Implementation: very simple
Allocate cost:
- Linear time worst case
Free cost:
- Constant time worst case
- Even with coalescing
Memory usage:
- Will depend on placement policy
- First-fit, next-fit or best-fit
Not used in practice for malloc/free because of linear-time allocation
- Used in many special purpose applications
However, the concepts of splitting and boundary tag coalescing are general to all allocators

Explicit List#

Explicit Free Lists#

Maintain list(s) of free blocks, not all blocks
- The “next” free block could be anywhere
  - So we need to store forward/back pointers, not just sizes
- Still need boundary tags for coalescing
- Luckily we track only free blocks, so we can use payload area
Logically

Physically
- Blocks can be in any order

Allocating From Explicit Free Lists#

Freeing With Explicit Free Lists#

Insertion policy
- LIFO (last-in-first-out) policy
  - Insert freed block at the beginning of the free list
  - Pro: Simple and constant time
  - Con: Studies suggest fragmentation is worse than address ordered
- Address-ordered policy
  - Insert freed blocks so that free list blocks are always in address order: addr(prev) < addr(curr) < addr(next)
  - Pro: Studies suggest fragmentation is lower than LIFO
  - Con: Requires search

Freeing With a LIFO Policy (Case 1)#

Insert the freed block at the root of the list

Freeing With a LIFO Policy (Case 2)#

Splice out successor block, coalesce both memory blocks and insert the new block at the root of the list

Freeing With a LIFO Policy (Case 3)#

Splice out predecessor block, coalesce both memory blocks, and insert the new block at the root of the list

Freeing With a LIFO Policy (Case 4)#

Splice out predecessor and successor blocks, coalesce all 3 memory blocks and insert the new block at the root of the list

Explicit List Summary#

Comparison to implicit list:
- Allocate is linear time in number of free blocks instead of all blocks
  - Much faster when most of the memory is full
- Slightly more complicated allocate and free since needs to splice blocks in and out of the list
- Some extra space for the links (2 extra words needed for each block)
Most common use of linked lists is in conjunction with segregated free lists
- Keep multiple linked lists of different size classes, or possibly for different types of objects

Segregated Free List#

Each size class of blocks has its own free list

Often have separate classes for each small size
For larger sizes: One class for each two-power size

Seglist Allocator#

Given an array of free lists, each one for some size class
To allocate a block of size n:
- Search appropriate free list for block of size m > n
- If an appropriate block is found:
  - Split block and place fragment on appropriate list (optional)
- If no block is found, try next larger class
- Repeat until block is found
If no block is found:
- Request additional heap memory from OS (using sbrk())
- Allocate block of n bytes from this new memory
- Place remainder as a single free block in largest size class
To free a block:
- Coalesce and place on appropriate list
Advantages of seglist allocators
- Higher throughput
  - log time for power-of-two size classes
- Better memory utilization
  - First-fit search of segregated free list approximates a best-fit search of entire heap
  - Extreme case: Giving each block its own size class is equivalent to best-fit

Summary of Key Allocator Policies#

Placement policy
- First-fit, next-fit, best-fit, etc
- Trades off lower throughput for less fragmentation
- Interesting observation: segregated free lists
  - Approximate a best fit placement policy without having to search entire free list
Splitting policy
- When do we go ahead and split free blocks ?
- How much internal fragmentation are we willing to tolerate ?
Coalescing policy
- Immediate coalescing: Coalesce each time free is called
- Deferred coalescing: Try to improve performance of free by deferring coalescing until needed. Examples:
  - Coalesce as you scan the free list for malloc
  - Coalesce when the amount of external fragmentation reaches some threshold

Garbage Collection#

Automatic reclamation of heap-allocated storage — application never has to free

1
void foo() {
2
  int *p = malloc(128);
3
  return; /* p block is now garbage */
4
}

Common in many dynamic languages:
- Python, Ruby, Java, Perl, ML, Lisp, Mathematica
Variants (“conservative” garbage collectors) exist for C and C++
- However, cannot necessarily collect all garbage
How does the memory manager know when memory can be freed ?
- In general we cannot know what is going to be used in the future since it depends on conditionals
- But we can tell that certain blocks cannot be used if there are no pointers to them
Must make certain assumptions about pointers
- Memory manager can distinguish pointers from non-pointers
- All pointers point to the start of a block
- Cannot hide pointers (e.g., by coercing them to an int, and then back again)

Classical GC Algorithms#

Mark-and-sweep collection (McCarthy, 1960)
- Does not move blocks (unless you also “compact”)
Reference counting (Collins, 1960)
- Does not move blocks
Copying collection (Minsky, 1963)
- Moves blocks
Generational Collectors (Lieberman and Hewitt, 1983)
- Collection based on lifetimes
  - Most allocations become garbage very soon
  - So focus reclamation work on zones of memory recently allocated

Memory as a Graph#

We view memory as a directed graph
- Each block is a node in the graph
- Each pointer is an edge in the graph
- Locations not in the heap that contain pointers into the heap are called root nodes (e.g. registers, locations on the stack, global variables)

A node (block) is reachable if there is a path from any root to that node
Non-reachable nodes are garbage (cannot be needed by the application)

Mark and Sweep Collecting#

Can build on top of malloc/free package
- Allocate using malloc until you “run out of space”
When out of space:
- Use extra mark bit in the head of each block
- Mark: Start at roots and set mark bit on each reachable block
- Sweep: Scan all blocks and free blocks that are not marked

Assumptions For a Simple Implementation#

Application
- new(n): returns pointer to new block with all locations cleared
- read(b, i): read location i of block b into register
- write(b, i, v): write v into location i of block b
Each block will have a header word
- Addressed as b[-1], for a block b
- Used for different purposes in different collectors
Instructions used by the Garbage Collector
- is_ptr(p): determines whether p is a pointer
- length(b): returns the length of block b, not including the header
- get_roots(): returns all the roots

Example#

Mark using depth-first traversal of the memory graph

1
ptr mark(ptr p) {
2
  if (!is_ptr(p)) return;       // do nothing if not pointer
3
  if (markBitSet(p)) return;    // check if already marked
4
  setMarkBit(p);                // set the mark bit
5
  for (i=0; i < length(p); i++) // call mark on all words
6
    mark(p[i]);                 // in the block
7
  return;
8
}

Sweep using lengths to find next block

1
ptr sweep(ptr p, ptr end) {
2
  while (p < end) {
3
    if markBitSet(p)
4
      clearMarkBit();
5
    else if (allocateBitSet(p))
6
      free(p);
7
    p += length(p);
8
}

Conservative Mark & Sweep in C#

A “conservative garbage collector” for C programs
- is_ptr() determines if a word is a pointer by checking if it points to an allocated block of memory
- But, in C pointers can point to the middle of a block

So how to find the beginning of the block ?
- Can use a balanced binary tree to keep track of all allocated blocks (key is start-of-block)
- Balanced-tree pointers can be stored in header (use two additional words)

System-Level I/O#

Unix I/O#

A Linux file is a sequence of m bytes
All I/O devices are represented as files
Even the kernel is represented as a file
- /boot/vmlinuz-3.13.0-55-generic: kernel image
- /proc: kernel data structures

File Types#

Each file has a type indicating its role in the system
- Regular file: Contains arbitrary data
- Directory: Index for a related group of files
- Socket: For communicating with a process on another machine
And more file types…
- Named pipes (FIFOs)
- Symbolic links
- Character and block devices
- …

Regular Files#

A regular file contains arbitrary data
Applications open distinguish between text files and binary files
- Text files are regular files with only ASCII or Unicode characters
- Binary files are everything else
  - e.g., object files, JPEG images
- Kernel doesn’t know the difference !
Text file is sequence of text lines
- Text line is sequence of chars terminated by newline char \n
  - Newline is 0x0a, same as ASCII line feed character LF
End Of Line (EOL) indicators in different systems:
- Linux and Mac OS: \n (0x0a)
  - Line Feed (LF)
- Windows and Internet protocols: \r\n (0x0d 0x0a)
  - Carriage return (CR) followed by line feed (LF)

Directories#

Directory consists of an array of links
- Each link maps a filename to a file
Each directory contains at least two entries
- . (dot) is a link to itself
- .. (dot dot) is a link to the parent directory

Directory Hierarchy#

All files are organized as a hierarchy anchored by root directory named / (slash)
Kernel maintains current working directory (cwd) for each process

Opening Files#

Opening a file informs the kernel that you are getting ready to access that file

1
int fd; /* file descriptor */
2

3
if ((fd = open("/etc/hosts", O_RDONLY)) < 0) {
4
  perror("open");
5
  exit(1);
6
}

Returns a small identifying integer file descriptor
- fd == -1 indicates that an error occurred
Each process created by a linux shell begins life with three open files associated with a terminal:
- 0: standard input (stdin)
- 1: standard output (stdout)
- 2: standard error (stderr)

Closing Files#

Closing a file informs the kernel that you are finished accessing that file

1
int fd;     /* file descriptor */
2
int retval; /* return value */
3

4
if ((retval = close(fd)) < 0) {
5
  perror("close");
6
  exit(1);
7
}

Closing an already closed file is a recipe for disaster in threaded programs
Moral: Always check return codes, even for seemingly benign functions such as close()

Reading Files#

Reading a file copies bytes from the current file position to memory, and then updates file position

1
char buf[512];
2
int fd;     /* file descriptor */
3
int nbytes; /* number of bytes read */
4

5
/* Open file fd ... */
6
/* Then read up to 512 bytes from file fd */
7
if ((nbytes = read(fd, buf, sizeof(buf))) < 0) {
8
  perror("read");
9
  exit(1);
10
}

Returns number of bytes read from file fd into buf

Writing Files#

Writing a file copies bytes from memory to the current file position, and then updates current file position

1
char buf[512];
2
int fd;     /* file descriptor */
3
int nbytes; /* number of bytes read */
4

5
/* Open the file fd ... */
6
/* Then write up to 512 bytes from buf to file fd */
7
if ((nbytes = write(fd, buf, sizeof(buf)) < 0) {
8
  perror("write");
9
  exit(1);
10
}

Returns number of bytes written from buf to file fd

Short Counts#

Short counts can occur in these situations:
- Encountering (end-of-file) EOF on reads
- Reading text lines from a terminal
- Reading and writing network sockets
Short counts never occur in these situations:
- Reading from disk files (except for EOF)
- Writing to disk files
Best practice is to always allow for short counts

1
ssize_t readn(int fd, void *buf, size_t size) {
2
  char *ptr = buf;
3
  size_t nleft = size;
4
  ssize_t nread;
5

6
  while (nleft > 0) {
7
    nread = read(fd, ptr, nleft);
8

9
    if (nread < 0)
10
      return -1; // error
11
    if (nread == 0)
12
      break; // EOF
13

14
    nleft -= nread;
15
    ptr += nread;
16
  }
17

18
  return (size - nleft);
19
}

Buffered I/O#

Motivation#

Applications open read/write one character at a time
- Read line of text one character at a time, stopping at newline
Implementing as Unix I/O calls expensive
- read and write require Unix kernel calls
  - $> 10000$ clock cycles
Solution: Buffered read
- Use Unix read to grab block of bytes
- User input functions take one byte at a time from buffer
  - Refill buffer when empty

Example#

Goal: Copying stdin to stdout, one byte at a time

Without builtin buffer#

1
int main() {
2
  char c;
3
  while (read(STDIN_FILENO, &c, sizeof(char)))
4
    write(STDOUT_FILENO, &c, sizeof(char));
5

6
  return 0;
7
}

To many read system calls, wasteful !

With builtin buffer#

1
#define BUFFER_SIZE 1024
2

3
ssize_t buffered_getchar(int fd, char *c) {
4
  static char buf[BUFFER_SIZE];
5
  static size_t size = 0;
6
  static size_t pos = 0;
7

8
  if (pos >= size) {
9
    size = read(fd, buf, BUFFER_SIZE);
10

11
    if (size <= 0) {
12
      return -1; // EOF or error
13
    }
14
    pos = 0;
15
  }
16
  *c = buf[pos++];
17

18
  return 1; // success got 1 character
19
}

Only one read system call.

File Metadata#

Metadata is data about data, in this case file data
Per-file metadata maintained by kernel
- Accessed by users with the stat and fstat functions

1
/* Metadata returned by the stat and fstat functions */
2
struct stat {
3
  dev_t st_dev;             /* Device */
4
  ino_t st_ino;             /* inode */
5
  mode_t st_mode;           /* Protection and file type */
6
  nlink_t st_nlink;         /* Number of hard links */
7
  uid_t st_uid;             /* User ID of owner */
8
  gid_t st_gid;             /* Group ID of owner */
9
  dev_t st_rdev;            /* Device type (if inode device) */
10
  off_t st_size;            /* Total size, in bytes */
11
  unsigned long st_blksize; /* Blocksize for filesystem I/O */
12
  unsigned long st_blocks;  /* Number of blocks allocated */
13
  time_t st_atime;          /* Time of last access */
14
  time_t st_mtime;          /* Time of last modification */
15
  time_t st_ctime;          /* Time of last change */
16
};

Example of Accessing Metadata#

1
int main(int argc, char **argv) {
2
  struct stat file_stat;
3
  char *type, *readok;
4

5
  stat(argv[1], &file_stat);
6

7
  if (S_ISREG(file_stat.st_mode))    /* Determine file type */
8
    type = "regular";
9
  else if (S_ISDIR(file_stat.st_mode))
10
    type = "directory";
11
  else
12
    type = "other";
13
  if ((file_stat.st_mode & S_IRUSR)) /* Check read access */
14
    readok = "yes";
15
  else
16
    readok = "no";
17

18
  printf("type: %s, read: %s\n", type, readok);
19
  exit(0);
20
}

Represents Open Files#

Two descriptors referencing two distinct open files. Descriptor 1 (stdout) points to terminal, and descriptor 4 points to open disk file

Two distinct descriptors sharing the same disk file through two distinct open file table entries
- E.g., Calling open twice with the same filename argument

A child process inherits its parent’s open files
- Note: situation unchanged by exec functions (use fcntl to change)

Before fork call:

After fork call:

Child’s table same as parent’s, and +1 to each refcnt

I/O Redirection#

dup2(oldfd, newfd)
- Copies (per-process) descriptor table entry oldfd to entry newfd

Example#

Open file to which stdout should be redirected
- Happens in child executing shell code, before exec

Call dup2(4,1)
- Cause fd=1 (stdout) to refer to disk file pointed at by fd=4

Standard I/O#

Standard I/O Streams#

Standard I/O models open files as streams
- Abstraction for a file descriptor and a buffer in memory
C programs begin life with three open streams (defined in stdio.h)
- stdin: standard input
- stdout: standard output
- stderr: standard error

1
#include <stdio.h>
2

3
extern FILE *stdin;  /* standard input (descriptor 0) */
4
extern FILE *stdout; /* standard output (descriptor 1) */
5
extern FILE *stderr; /* standard error (descriptor 2) */
6

7
int main() {
8
  fprintf(stdout, "Hello, world\n");
9
}

Buffering in Standard I/O#

Standard I/O functions use buffered I/O

1
int main() {
2
  printf("h");
3
  printf("e");
4
  printf("l");
5
  printf("l");
6
  printf("o");
7
  printf("\n");
8
  fflush(stdout);
9
  exit(0);
10
}

Buffer flushed to output fd on \n, call to fflush or exit, or return from main

Network Programming#

A Client-Server Transaction#

Most network applications are based on the client-server model:
- A server process and one or more client processes
- Server manages some resources
- Server provides service by manipulating resource for clients
- Server activated by request from client (vending machine analogy)

Hardware Organization of a Network Host#

Computer Networks#

A network is a hierarchical system of boxes and wires organized by geographical proximity
- SAN (System Area Network) spans cluster or machine room
  - Switched Ethernet, Quadrics QSW, …
- LAN (Local Area Network) spans a building or campus
  - Ethernet is most prominent example
- WAN (Wide Area Network) spans country or world
  - Typically high-speed point-to-point phone lines
An internetwork (internet) is an interconnected set of networks
- The Global IP Internet (uppercase “I”) is the most famous example of an internet (lowercase “i”)

Lowest Level: Ethernet Segment#

Ethernet segment consists of a collection of hosts connected by wires (twisted pairs) to a hub
Spans room or floor in a building
Operation
- Each Ethernet adapter has a unique 48-bit address (MAC address)
  - E.g., 00:16:ea:e3:54:e6
- Hosts send bits to any other host in chunks called frames
- Hub slavishly copies each bit from each port to every other port
  - Every host sees every bit
  - Note: Hubs are on their way out. Bridges (switches, routers) became cheap enough to replace them

Next Level: Bridged Ethernet Segment#

Spans building or campus
Bridges cleverly learn which hosts are reachable from which ports and then selectively copy frames from port to port

Conceptual View of LANs#

For simplicity, hubs, bridges, and wires are often shown as a collection of hosts attached to a single wire:

Next Level: internets (lower case)#

Multiple incompatible LANs can be physically connected by specialized computers called routers
The connected networks are called an internet (lower case)

Logical Structure of an internet#

Ad hoc interconnection of networks
- No particular topology
- Vastly different router & link capacities
Send packets from source to destination by hopping through networks
- Router forms bridge from one network to another
- Different packets may take different routes

The Notion of an internet Protocol#

How is it possible to send bits across incompatible LANs and WANs ?
Solution: protocol software running on each host and router
- Protocol is a set of rules that governs how hosts and routers should cooperate when they transfer data from network to network
- Smooths out the differences between the different networks

What Does an internet Protocol Do ?#

Provides a naming scheme
- An internet protocol defines a uniform format for host addresses
- Each host (and router) is assigned at least one of these internet addresses that uniquely identifies it
Provides a delivery mechanism
- An internet protocol defines a standard transfer unit (packet)
- Packet consists of header and payload
  - Header: contains info such as packet size, source and destination addresses
  - Payload: contains data bits sent from source host

Transferring internet Data Via Encapsulation#

Other Issues#

We are glossing over a number of important questions:
- What if different networks have different maximum frame sizes ? (segmentation)
- How do routers know where to forward frames ?
- How are routers informed when the network topology changes ?
- What if packets get lost ?
These (and other) questions are addressed by the area of systems known as computer networking

Global IP Internet (upper case)#

Most famous example of an internet
Based on the TCP/IP protocol family
- IP (Internet Protocol):
  - Provides basic naming scheme and unreliable delivery capability of packets (datagrams) from host-to-host
- UDP (Unreliable Datagram Protocol)
  - Uses IP to provide unreliable datagram delivery from process-to-process
- TCP (Transmission Control Protocol)
  - Uses IP to provide reliable byte streams from process-to-process over connections
Accessed via a mix of Unix file I/O and functions from the sockets interface

Hardware and Software Organization of an Internet Application#

A Programmer’s View of the Internet#

Hosts are mapped to a set of 32-bit IP addresses
The set of IP addresses is mapped to a set of identifiers called Internet domain names
A process on one Internet host can communicate with a process on another Internet host over a connection

Aside: IPv4 and IPv6#

The original Internet Protocol, with its 32-bit addresses, is known as Internet Protocol Version 4 (IPv4)
1996: Internet Engineering Task Force (IETF) introduced Internet Protocol Version 6 (IPv6) with 128-bit addresses
- Intended as the successor to IPv4

IP Addresses#

32-bit IP addresses are stored in an IP address struct
- IP addresses are always stored in memory in network byte order (big-endian byte order)
- True in general for any integer transferred in a packet header from one machine to another
  - E.g., the port number used to identify an Internet connection

1
/* Internet address structure */
2
struct in_addr {
3
  uint32_t s_addr; /* network byte order (big-endian) */
4
};

Dotted Decimal Notation#

By convention, each byte in a 32-bit IP address is represented by its decimal value and separated by a period
- IP address: 0x8002C2F2 = 128.2.194.242
Use getaddrinfo and getnameinfo functions to convert between IP addresses and dotted decimal format

IP Address Structure#

IP (V4) Address space divided into classes:

Network ID Written in form w.x.y.z/n
- n = number of bits in network address (Net ID)
- E.g., CMU written as 128.2.0.0/16
  - Class B address
Unrouted (private) IP addresses:
- 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16

Internet Domain Names#

Domain Naming System (DNS)#

The Internet maintains a mapping between IP addresses and domain names in a huge worldwide distributed database called DNS
Conceptually, programmers can view the DNS database as a collection of millions of host entries
- Each host entry defines the mapping between a set of domain names and IP addresses
- In a mathematical sense, a host entry is an equivalence class of domain names and IP addresses

Basic Internet Components#

Internet backbone
- Collection of routers (nationwide or worldwide) connected by high-speed point-to-point networks
Internet Exchange Points (IXP)
- Router that connects multiple backbones (often referred to as peers)
- Also called Network Access Points (NAP)
Regional networks
- Smaller backbones that cover smaller geographical areas (e.g., cities or states)
Point of presence (POP)
- Machine that is connected to the Internet
Internet Service Providers (ISPs)
- Provide dial-‐up or direct access to POPs

Internet Connection Hierarchy#

Internet Connections#

Clients and servers communicate by sending streams of bytes over connections. Each connection is:
- Point-to-point: connects a pair of processes
- Full-duplex: data can flow in both directions at the same time
- Reliable: stream of bytes sent by the source is eventually received by the destination in the same order it was sent
A socket is an endpoint of a connection
- Socket address is an IPaddress:Port pair
A port is a 16-bit integer that identifies a process:
- Ephemeral port: Assigned automatically by client kernel when client makes a connection request
- Well‐known port: Associated with some service provided by a server (e.g., port 80 is associated with Web servers)

Well-known Ports and Service Names#

Popular services have permanently assigned well-known ports and corresponding well-known service names:
- echo server: 7/echo
- ssh servers: 22/ssh
- email server: 25/smtp
- web servers: 80/http
Mappings between well-known ports and service names is contained in the file /etc/services on each Linux machine

Anatomy of a Connection#

A connection is uniquely identified by the socket addresses of its endpoints (socket pair)
- (cliaddr:cliport, servaddr:servport)

Using Ports to Identify Services#

Sockets#

What is a socket ?
- To the kernel, a socket is an endpoint of communication
- To an application, a socket is a file descriptor that lets the application read/write from/to the network
  - Remember: All Unix I/O devices, including networks, are modeled as files
Clients and servers communicate with each other by reading from and writing to socket descriptors

The main distinction between regular file I/O and socket I/O is how the application “opens” the socket descriptors

Socket Address Structures#

Generic socket address:
- For address arguments to connect, bind, and accept
- Necessary only because C did not have generic (void *) pointers when the sockets interface was designed
- For casting convenience, we adopt the Stevens convention: typedef struct sockaddr SA;

1
struct sockaddr {
2
  uint16_t sa_family; /* Protocol family */
3
  char sa_data[14];   /* Address data */
4
};

Internet-specific socket address:
- Must cast (struct sockaddr_in *) to (struct sockaddr *) for functions that take socket address arguments

1
struct sockaddr_in {
2
  uint16_t sin_family;       /* Protocol family (always AF_INET) */
3
  uint16_t sin_port;         /* Port num in network byte order */
4
  struct in_addr sin_addr;   /* IP addr in network byte order */
5
  unsigned char sin_zero[8]; /* Pad to sizeof(struct sockaddr) */
6
};

Sockets Interface#

Set of system-level functions used in conjunction with Unix I/O to build network applications
Created in the early 80’s as part of the original Berkeley distribution of Unix that contained an early version of the Internet protocols
Available on all modern systems

socket#

Clients and servers use the socket function to create a socket descriptor: int socket(int domain, int type, int protocol)
Example: int clientfd = socket(AF_INET, SOCK_STREAM, 0);

TIP
Protocol specific ! Best practice is to use getaddrinfo to generate the parameters automatically, so that code is protocol independent.

bind#

A server uses bind to ask the kernel to associate the server’s socket address with a socket descriptor: int bind(int sockfd, SA *addr, socklen_t addrlen);
The process can read bytes that arrive on the connection whose endpoint is addr by reading from descriptor sockfd
Similarly, writes to sockfd are transferred along connection whose endpoint is addr

TIP
Best practice is to use getaddrinfo to supply the arguments addr and addrlen.

listen#

By default, kernel assumes that descriptor from socket function is an active socket that will be on the client end of a connection
A server calls the listen function to tell the kernel that a descriptor will be used by a server rather than a client: int listen(int sockfd, int backlog);
Converts sockfd from an active socket to a listening (passive) socket that can accept connection requests from clients
backlog is a hint about the number of outstanding connection requests that the kernel should queue up before starting to refuse requests

accept#

Servers wait for connection requests from clients by calling accept: int accept(int listenfd, SA *addr, int *addrlen);
Waits for connection request to arrive on the connection bound to listenfd, then fills in client’s socket address in addr and size of the socket address in addrlen
Returns a connected descriptor that can be used to communicate with the client via Unix I/O routines

connect#

A client establishes a connection with a server by calling connect: int connect(int clientfd, SA *addr, socklen_t addrlen);
Attempts to establish a connection with server at socket address addr
- If successful, then clientfd is now ready for reading and writing
- Resulting connection is characterized by socket pair (x:y, addr.sin_addr:addr.sin_port)
  - x is client address
  - y is ephemeral port that uniquely identifies client process on client host

TIP
Best practice is to use getaddrinfo to supply the arguments addr and addrlen.

Connected vs. Listening Descriptors#

Listening descriptor
- End point for client connection requests
- Created once and exists for lifetime of the server
Connected descriptor
- End point of the connection between client and server
- A new descriptor is created each time the server accepts a connection request from a client
- Exists only as long as it takes to service client
Why the distinction ?
- Allows for concurrent servers that can communicate over many client connections simultaneously
  - E.g., Each time we receive a new request, we fork a child to handle the request

Host and Service Conversion#

getaddrinfo#

getaddrinfo is the modern way to convert string representations of hostnames, host addresses, ports, and service names to socket address structures
- Replaces obsolete gethostbyname and getservbyname functions
Advantages:
- Reentrant (can be safely used by threaded programs)
- Allows us to write portable protocol-independent code
  - Works with both IPv4 and IPv6
Disadvantages:
- Somewhat complex
- Fortunately, a small number of usage patterns suffice in most cases

1
int getaddrinfo(const char *host,             /* Hostname or address */
2
                const char *service,          /* Port or service name */
3
                const struct addrinfo *hints, /* Input parameters */
4
                struct addrinfo **result);    /* Output linked list */
5

6
void freeaddrinfo(struct addrinfo *result);   /* Free linked list */
7

8
const char *gai_strerror(int errcode);        /* Return error msg. */

Given host and service, getaddrinfo returns result that points to a linked list of addrinfo structs, each of which points to a corresponding socket address struct, and which contains arguments for the sockets interface functions
Helper functions:
- freeadderinfo frees the entire linked list
- gai_strerror converts error code to an error message

Linked List Returned by getaddrinfo#

Clients: walk this list, trying each socket address in turn, until the calls to socket and connect succeed
Servers: walk the list until calls to socket and bind succeed

addrinfo Struct#

1
struct addrinfo {
2
  int ai_flags;             /* Hints argument flags */
3
  int ai_family;            /* First arg to socket function */
4
  int ai_socktype;          /* Second arg to socket function */
5
  int ai_protocol;          /* Third arg to socket function */
6
  char *ai_canonname;       /* Canonical host name */
7
  size_t ai_addrlen;        /* Size of ai_addr struct */
8
  struct sockaddr *ai_addr; /* Ptr to socket address structure */
9
  struct addrinfo *ai_next; /* Ptr to next item in linked list */
10
};

Each addrinfo struct returned by getaddrinfo contains arguments that can be passed directly to socket function
Also points to a socket address struct that can be passed directly to connect and bind functions

getnameinfo#

getnameinfo is the inverse of getaddrinfo, converting a socket address to the corresponding host and service
- Replaces obsolete gethostbyaddr and getservbyport functions
- Reentrant and protocol independent

1
int getnameinfo(const SA *sa, socklen_t salen, /* In: socket addr */
2
                char *host, size_t hostlen,    /* Out: host */
3
                char *serv, size_t servlen,    /* Out: service */
4
                int flags);                    /* optional flags */

Conversion Example#

1
#include <netdb.h>
2
#include <stdio.h>
3
#include <stdlib.h>
4
#include <string.h>
5

6
#define MAXLINE NI_MAXHOST
7

8
typedef struct addrinfo SA;
9

10
int main(int argc, char **argv) {
11
  SA *p, *listp, hints;
12
  char buf[MAXLINE];
13
  int rc, flags;
14

15
  /* Get a list of addrinfo records */
16
  memset(&hints, 0, sizeof(SA));
17
  hints.ai_family = AF_INET;       /* IPv4 only */
18
  hints.ai_socktype = SOCK_STREAM; /* Connections only */
19

20
  if ((rc = getaddrinfo(argv[1], NULL, &hints, &listp)) != 0) {
21
    fprintf(stderr, "getaddrinfo error: %s\n", gai_strerror(rc));
22
    exit(1);
23
  }
24

25
  /* Walk the list and display each IP address */
26
  flags = NI_NUMERICHOST; /* Display address instead of name */
27
  for (p = listp; p; p = p->ai_next) {
28
    getnameinfo(p->ai_addr, p->ai_addrlen, buf, MAXLINE, NULL, 0, flags);
29
    printf("%s\n", buf);
30
  }
31

32
  /* Clean up */
33
  freeaddrinfo(listp);
34
  exit(0);
35
}

Echo Client Example#

open_clientfd#

Establish a connection with a server

1
int open_clientfd(char *hostname, char *port) {
2
  int clientfd;
3
  struct addrinfo hints, *listp, *p;
4

5
  /* Get a list of potential server addresses */
6
  memset(&hints, 0, sizeof(struct addrinfo));
7
  hints.ai_socktype = SOCK_STREAM; /* Open a connection */
8
  hints.ai_flags = AI_NUMERICSERV; /* Using numeric port arg. */
9
  hints.ai_flags |= AI_ADDRCONFIG; /* Recommended for connections */
10

11
  getaddrinfo(hostname, port, &hints, &listp);
12

13
  /* Walk the list for one that we can successfully connect to */
14
  for (p = listp; p; p = p->ai_next) {
15
    /* Create a socket descriptor */
16
    if ((clientfd = socket(p->ai_family, p->ai_socktype, p->ai_protocol)) < 0)
17
      continue; /* Socket failed, try the next */
18

19
    /* Connect to the server */
20
    if (connect(clientfd, p->ai_addr, p->ai_addrlen) != -1)
21
      break; /* Success */
22
    close(clientfd); /* Connect failed, try another */
23
  }
24

25
  /* Clean up */
26
  freeaddrinfo(listp);
27

28
  if (!p) /* All connects failed */
29
    return -1;
30
  else    /* The last connect succeeded */
31
    return clientfd;
32
}

open_listenfd#

Create a listening descriptor that can be used to accept connection requests from clients

1
int open_listenfd(char *port) {
2
  struct addrinfo hints, *listp, *p;
3
  int listenfd, optval = 1;
4

5
  /* Get a list of potential server addresses */
6
  memset(&hints, 0, sizeof(struct addrinfo));
7
  hints.ai_socktype = SOCK_STREAM;             /* Accept connect. */
8
  hints.ai_flags = AI_PASSIVE | AI_ADDRCONFIG; /* On any IP addr */
9
  hints.ai_flags |= AI_NUMERICSERV;            /* Using port no. */
10

11
  getaddrinfo(NULL, port, &hints, &listp);
12

13
  /* Walk the list for one that we can bind to */
14
  for (p = listp; p; p = p->ai_next) {
15
    /* Create a socket descriptor */
16
    if ((listenfd = socket(p->ai_family, p->ai_socktype, p->ai_protocol)) < 0)
17
      continue; /* Socket failed, try the next */
18

19
    /* Eliminates "Address already in use" error from bind */
20
    setsockopt(listenfd, SOL_SOCKET, SO_REUSEADDR, (const void *)&optval , sizeof(int));
21

22
    /* Bind the descriptor to the address */
23
    if (bind(listenfd, p->ai_addr, p->ai_addrlen) == 0)
24
      break; /* Success */
25
    close(listenfd); /* Bind failed, try the next */
26
  }
27

28
  /* Clean up */
29
  freeaddrinfo(listp);
30

31
  if (!p) /* No address worked */
32
    return -1;
33

34
  /* Make it a listening socket ready to accept conn. requests */
35
  if (listen(listenfd, LISTENQ) < 0) {
36
    close(listenfd);
37
    return -1;
38
  }
39
  return listenfd;
40
}

NOTE
open_clientfd and open_listenfd are both independent of any particular version of IP.

Echo Client#

1
#include <csapp.h>
2

3
int main(int argc, char **argv) {
4
  int clientfd;
5
  char *host, *port, buf[MAXLINE];
6
  rio_t rio;
7

8
  host = argv[1];
9
  port = argv[2];
10

11
  clientfd = open_clientfd(host, port);
12
  rio_readinitb(&rio, clientfd);
13

14
  while (fgets(buf, MAXLINE, stdin) != NULL) {
15
    rio_writen(clientfd, buf, strlen(buf));
16
    rio_readlineb(&rio, buf, MAXLINE);
17
    fputs(buf, stdout);
18
  }
19
  close(clientfd);
20
  exit(0);
21
}

Iterative Echo Server#

1
#include <csapp.h>
2

3
void echo(int connfd);
4

5
int main(int argc, char **argv) {
6
  int listenfd, connfd;
7
  socklen_t clientlen;
8
  struct sockaddr_storage clientaddr; /* Enough room for any addr */
9
  char client_hostname[MAXLINE], client_port[MAXLINE];
10

11
  listenfd = open_listenfd(argv[1]);
12
  while (1) {
13
    clientlen = sizeof(struct sockaddr_storage); /* Important! */
14
    connfd = accept(listenfd, (SA *)&clientaddr, &clientlen);
15
    getnameinfo((SA *)&clientaddr, clientlen, client_hostname, MAXLINE, client_port, MAXLINE, 0);
16
    printf("Connected to (%s, %s)\n", client_hostname, client_port);
17
    echo(connfd);
18
    close(connfd);
19
  }
20
  exit(0);
21
}

echo#

The server uses RIO to read and echo text lines until EOF (end-of-file) condition is encountered
- EOF condition caused by client calling close(clientfd)

1
void echo(int connfd) {
2
  size_t n;
3
  char buf[MAXLINE];
4
  rio_t rio;
5

6
  rio_readinitb(&rio, connfd);
7
  while((n = rio_readlineb(&rio, buf, MAXLINE)) != 0) {
8
    printf("server received %d bytes\n", (int)n);
9
    rio_writen(connfd, buf, n);
10
  }
11
}

Web Server Basics#

Clients and servers communicate using the Hyper Text Transfer Protocol (HTTP)
- Client and server establish TCP connection
- Client requests content
- Server responds with requested content
- Client and server close connection (eventually)
HTTP/1.1 RFC 2616, June, 1999
- https://www.w3.org/Protocols/rfc2616/rfc2616.html

HTTP Versions#

Major differences between HTTP/1.1 and HTTP/1.0
- HTTP/1.0 uses a new connection for each transaction
- HTTP/1.1 also supports persistent connections
  - Multiple transactions over the same connection
  - Connection: Keep-Alive
- HTTP/1.1 requires HOST header
  - Host: www.cmu.edu
  - Makes it possible to host multiple websites at single Internet host
- HTTP/1.1 supports chunked encoding
  - Transfer-Encoding: chunked
- HTTP/1.1 adds additional support for caching

Web Content#

Web servers return content to clients
- content: a sequence of bytes with an associated MIME (Multipurpose Internet Mail Extensions) type
Example MIME types
- text/html — HTML document
- text/plain — Unformatted text
- image/gif — Binary image encoded in GIF format
- image/png — Binary image encoded in PNG format
- image/jpeg — Binary image encoded in JPEG format
The complete list of MIME types: https://www.iana.org/assignments/media-types/media-types.xhtml

Static and Dynamic Content#

The content returned in HTTP responses can be either static or dynamic
- Static content: content stored in files and retrieved in response to an HTTP request
  - E.g., HTML files, images, audio clips
  - Request identifies which content file
- Dynamic content: content produced on-the-fly in response to an HTTP request
  - E.g., content produced by a program executed by the server on behalf of the client
  - Request identifies file containing executable code
Bottom line: Web content is associated with a file that is managed by the server

URLs#

Unique name for a file: URL (Universal Resource Locator)
- Example URL: http://www.cmu.edu:80/index.html
Clients use prefix (http://www.cmu.edu:80) to infer:
- What kind (protocol) of server to contact (HTTP)
- Where the server is (www.cmu.edu)
- What port it is listening on (80)
Servers use suffix (/index.html) to:
- Determine if request is for static or dynamic content
  - No hard and fast rules for this
  - One convention: executables reside in cgi-bin directory
- Find file on file system
  - Initial / in suffix denotes home directory for requested content
  - Minimal suffix is /, which server expands to configured default filename (usually, index.html)

HTTP Requests#

HTTP request is a request line, followed by zero or more request headers
Request line: <method> <uri> <version>
- <method> is one of GET, POST, OPTIONS, HEAD, PUT, DELETE, or TRACE
- <uri> is typically URL for proxies, URL suffix for servers
  - A URL is a type of URI (Uniform Resource Identifier)
  - https://www.ietf.org/rfc/rfc2396.txt
- <version> is HTTP version of request (HTTP/1.0 or HTTP/1.1)
Request headers: <header name>: <header data>
- Provide additional information to the server

HTTP Responses#

HTTP response is a response line followed by zero or more response headers, possibly followed by content, with blank line (\r\n) separating headers from content
Response line: <version> <status code> <status msg>
- <version> is HTTP version of the response
- <status code> is numeric status
- <status msg> is corresponding English text
  - 200 OK — Request was handled without error
  - 301 Moved — Provide alternate URL
  - 404 Not found — Server couldn’t find the file
- Response headers: <header name>: <header data>
  - Provide additional information about response
  - Content-Type: MIME type of content in response body
  - Content-Length: Length of content in response body

Data Transfer Mechanisms#

Standard
- Specify total length with content-length
- Requires that program buffer entire message
Chunked
- Break into blocks
- Prefix each block with number of bytes (Hex coded)

Chunked Encoding Example#

HTTP/1.1 200 OK\n
Date: Sun, 31 Oct 2010 20:47:48 GMT\n
Server: Apache/1.3.41 (Unix)\n
Keep-Alive: timeout=15, max=100\n
Connection: Keep-Alive\n
Transfer-Encoding: chunked\n
Content-Type: text/html\n
\r\n
d75\r\n
<html>
<head>
.<link href="http://www.cs.cmu.edu/style/calendar.css" rel="stylesheet" type="text/css">
</head>
<body id="calendar_body">
<div id='calendar'><table width='100%' border='0' cellpadding='0' cellspacing='1' id='cal'>
...
</body>
</html>
\r\n
0\r\n
\r\n

d75\r\n — First Chunk: 0xd75 = 3445 bytes
0\r\n — Second Chunk: 0 bytes (indicates last chunk)

Example HTTP Transaction#

whaleshark> telnet www.cmu.edu 80             Client: open connection to server
Trying 128.2.42.52...                         Telnet prints 3 lines to terminal
Connected to WWW-CMU-PROD-VIP.ANDREW.cmu.edu.
Escape character is '^]'.
GET / HTTP/1.1                                Client: request line
Host: www.cmu.edu                             Client: required HTTP/1.1 header
                                              Client: empty line terminates headers
HTTP/1.1 301 Moved Permanently                Server: response line
Date: Wed, 05 Nov 2014 17:05:11 GMT           Server: followed by 5 response headers
Server: Apache/1.3.42 (Unix)                  Server: this is an Apache server
Location: http://www.cmu.edu/index.shtml      Server: page has moved here
Transfer-Encoding: chunked                    Server: response body will be chunked
Content-Type: text/html; charset=...          Server: expect HTML in response body
                                              Server: empty line terminates headers
15c                                           Server: first line in response body
<HTML><HEAD>                                  Server: start of HTML content
...
</BODY></HTML>                                Server: end of HTML content
0                                             Server: last line in response body
Connection closed by foreign host.            Server: closes connection

HTTP standard requires that each text line end with \r\n
Blank line (\r\n) terminates request and response headers

Tiny Web Server#

Tiny Web server described in text
- Tiny is a sequential Web server
- Serves static and dynamic content to real browsers
- 239 lines of commented C code
- Not as complete or robust as a real Web server
  - You can break it with poorly-formed HTTP requests (e.g., terminate lines with \n instead of \r\n)

Operation#

Accept connection from client
Read request from client (via connected socket)
Split into <method> <uri> <version>
- If method not GET, then return error
If URI contains cgi-bin then serve dynamic content
- Would do wrong thing if had file abcgi-bingo.html
- Fork process to execute program
Otherwise serve static content
- Copy file to output

Serving Static Content#

1
void serve_static(int fd, char *filename, int filesize) {
2
  int srcfd;
3
  char *srcp, filetype[MAXLINE], buf[MAXBUF];
4

5
  /* Send response headers to client */
6
  get_filetype(filename, filetype);
7
  sprintf(buf, "HTTP/1.0 200 OK\r\n");
8
  sprintf(buf, "%sServer: Tiny Web Server\r\n", buf);
9
  sprintf(buf, "%sConnection: close\r\n", buf);
10
  sprintf(buf, "%sContent-length: %d\r\n", buf, filesize);
11
  sprintf(buf, "%sContent-type: %s\r\n\r\n", buf, filetype);
12
  rio_writen(fd, buf, strlen(buf));
13

14
  /* Send response body to client */
15
  srcfd = open(filename, O_RDONLY, 0);
16
  srcp = mmap(0, filesize, PROT_READ, MAP_PRIVATE, srcfd, 0);
17
  close(srcfd);
18
  rio_writen(fd, srcp, filesize);
19
  munmap(srcp, filesize);
20
}

Serving Dynamic Content#

Client sends request to server
If request URI contains the string /cgi-bin, the Tiny server assumes that the request is for dynamic content

The server creates a child process and runs the program identified by the URI in that process

The child runs and generates the dynamic content
The server captures the content of the child and forwards it without modification to the client

Issues in Serving Dynamic Content#

How does the client pass program arguments to the server ?
How does the server pass these arguments to the child ?
How does the server pass other info relevant to the request to the child ?
How does the server capture the content produced by the child ?
These issues are addressed by the Common Gateway Interface (CGI) specification

CGI#

Because the children are written according to the CGI spec, they are often called CGI programs
However, CGI really defines a simple standard for transferring information between the client (browser), the server, and the child process
CGI is the original standard for generating dynamic content. Has been largely replaced by other, faster techniques:
- E.g., fastCGI, Apache modules, Java servlets, Rails controllers
- Avoid having to create process on the fly (expensive and slow)

Serving Dynamic Content With GET#

Client pass arguments to the server by appended the arguments to the URI
Can be encoded directly in a URL typed to a browser or a URL in an HTML link
- http://add.com/cgi-bin/adder?15213&18213
- adder is the CGI program on the server that will do the addition
- Argument list starts with ?
- Arguments separated by &
- Spaces represented by + or %20
URL suffix: cgi-bin/adder?15213&18213
Result displayed on browser:

Welcome to add.com: THE Internet addition portal.

The answer is: 15213 + 18213 = 33426

Thanks for visiting!

The server pass these arguments to the child by environment variable in QUERY_STRING
- A single string containing everything after the ?
- For add: QUERY_STRING = 15213&18213

1
/* Extract the two arguments */
2
if ((buf = getenv("QUERY_STRING")) != NULL) {
3
  p = strchr(buf, '&');
4
  *p = '\0';
5
  strcpy(arg1, buf);
6
  strcpy(arg2, p+1);
7
  n1 = atoi(arg1);
8
  n2 = atoi(arg2);
9
}

The child generates its output on stdout. Server uses dup2 to redirect stdout to its connected socket. So the server capture the content produced by the child

1
void serve_dynamic(int fd, char *filename, char *cgiargs) {
2
  char buf[MAXLINE], *emptylist[] = { NULL };
3

4
  /* Return first part of HTTP response */
5
  sprintf(buf, "HTTP/1.0 200 OK\r\n");
6
  rio_writen(fd, buf, strlen(buf));
7
  sprintf(buf, "Server: Tiny Web Server\r\n");
8
  rio_writen(fd, buf, strlen(buf));
9

10
  if (fork() == 0) { /* Child */
11
    /* Real server would set all CGI vars here */
12
    setenv("QUERY_STRING", cgiargs, 1);
13
    dup2(fd, STDOUT_FILENO);              /* Redirect stdout to client */
14
    execve(filename, emptylist, environ); /* Run CGI program */
15
  }
16
  wait(NULL); /* Parent waits for and reaps child */
17
}

Notice that only the CGI child process knows the content type and length, so it must generate those headers

1
/* Make the response body */
2
sprintf(content, "Welcome to add.com: ");
3
sprintf(content, "%sTHE Internet addition portal.\r\n<p>", content);
4
sprintf(content, "%sThe answer is: %d + %d = %d\r\n<p>", content, n1, n2, n1 + n2);
5
sprintf(content, "%sThanks for visiting!\r\n", content);
6

7
/* Generate the HTTP response */
8
printf("Content-length: %d\r\n", (int)strlen(content));
9
printf("Content-type: text/html\r\n\r\n");
10
printf("%s", content);
11
fflush(stdout);
12
exit(0);

Proxies#

A proxy is an intermediary between a client and an origin server
- To the client, the proxy acts like a server
- To the server, the proxy acts like a client

Why Proxies ?#

Can perform useful functions as requests and responses pass by
- Examples: Caching, logging, anonymization, filtering, transcoding

Concurrent Programming#

Concurrent Programming is Hard#

The human mind tends to be sequential
The notion of time is often misleading
Thinking about all possible sequences of events in a computer system is at least error prone and frequently impossible
Classical problem classes of concurrent programs:
- Races: outcome depends on arbitrary scheduling decisions elsewhere in the system
  - Example: who gets the last seat on the airplane ?
- Deadlock: improper resource allocation prevents forward progress
  - Example: traffic gridlock
- Livelock / Starvation / Fairness: external events and/or system scheduling decisions can prevent sub-task progress
  - Example: people always jump in front of you in line

Iterative Servers#

Iterative servers process one request at a time

Second client attempts to connect to iterative server
Call to connect returns
- Even though connection not yet accepted
- Server side TCP manager queues request
- Feature known as “TCP listen backlog”
Call to rio_writen returns
- Server side TCP manager buffers input data
Call to rio_readlineb blocks
- Server hasn’t written anything for it to read yet

Solution: use concurrent servers instead
- Concurrent servers use multiple concurrent flows to serve multiple clients at the same time

Approaches for Writing Concurrent Servers#

Process-based
- Kernel automatically interleaves multiple logical flows
- Each flow has its own private address space
Event-based
- Programmer manually interleaves multiple logical flows
- All flows share the same address space
- Uses technique called I/O multiplexing
Thread-based
- Kernel automatically interleaves multiple logical flows
- Each flow shares the same address space
- Hybrid of of process-based and event-based

Process-based Servers#

Spawn separate process for each client

Process-Based Concurrent Echo Server#

1
void sigchld_handler(int sig) {
2
  /* Reap all zombie children */
3
  while (waitpid(-1, 0, WNOHANG) > 0)
4
    ;
5
  return;
6
}
7

8
int main(int argc, char **argv) {
9
  int listenfd, connfd;
10
  socklen_t clientlen;
11
  struct sockaddr_storage clientaddr;
12

13
  signal(SIGCHLD, sigchld_handler);
14
  listenfd = open_listenfd(argv[1]);
15
  while (1) {
16
    clientlen = sizeof(struct sockaddr_storage);
17
    connfd = accept(listenfd, (SA *)&clientaddr, &clientlen);
18

19
    if (fork() == 0) {
20
      close(listenfd); /* Child closes its listening socket */
21
      echo(connfd);    /* Child services client */
22
      close(connfd);   /* Child closes connection with client */
23
      exit(0);         /* Child exits */
24
    }
25
    close(connfd); /* Parent closes connected socket (important!) */
26
  }
27
}

Process-based Server Execution Model#

Each client handled by independent child process
No shared state between them
Both parent & child have copies of listenfd and connfd
- Parent must close connfd
- Child should close listenfd

Issues with Process-based Servers#

Listening server process must reap zombie children to avoid fatal memory leak
Parent process must close its copy of connfd
- Kernel keeps reference count for each socket/open file
- After fork, refcnt(connfd) = 2
- Connection will not be closed until refcnt(connfd) = 0

Pros and Cons of Process-based Servers#

Handle multiple connections concurrently
Clean sharing model
- descriptors (no)
- file tables (yes)
- global variables (no)
Simple and straightforward
Additional overhead for process control
Nontrivial to share data between processes
- Requires IPC (interprocess communication) mechanisms
  - FIFO’s (named pipes), System V shared memory and semaphores

Event-based Servers#

Server maintains set of active connections
- Array of connfd’s
Repeat:
- Determine which descriptors (connfd’s or listenfd) have pending inputs
  - e.g., using select or epoll functions
  - arrival of pending input is an event
- If listenfd has input, then accept connection and add new connfd to array
- Service all connfd’s with pending inputs

I/O Multiplexed Event Processing#

Pros and Cons of Event-based Servers#

One logical control flow and address space
Can single-step with a debugger
No process or thread control overhead
- Design of choice for high-performance Web servers and search engines
  - E.g., Node.js, Nginx, Tornado
Significantly more complex to code than process-based or thread-based designs
Hard to provide fine-grained concurrency
- E.g., How to deal with partial HTTP request headers
Cannot take advantage of multi-core
- Single thread of control

Thread-based Servers#

Very similar to process-based approach
- …but using threads instead of processes

Traditional View of a Process#

Process = Process Context + Code, Data, and Stack

Alternate View of a Process#

Process = Thread + Code, Data, and Kernel Context

A Process With Multiple Threads#

Multiple threads can be associated with a process
- Each thread has its own logical control flow
- Each thread shares the same code, data, and kernel context
- Each thread has its own stack for local variables
  - but not protected from other threads
- Each thread has its own Thread ID (TID)

Logical View of Threads#

Threads associated with process form a pool of peers
- Unlike processes which form a tree hierarchy

Concurrent Threads#

Two threads are concurrent if their flows overlap in time
Otherwise, they are sequential
Examples:
- Concurrent: A & B, A & C
- Sequential: B & C

Concurrent Thread Execution#

Single Core Processor
- Simulate parallelism by time slicing
Multi Core Processor
- Can have true parallelism

Threads vs. Processes#

How threads and processes are similar ?
- Each has its own logical control flow
- Each can run concurrently with others (possibly on different cores)
- Each is context switched
How threads and processes are different ?
- Threads share all code and data (except local stacks)
  - Processes (typically) do not
- Threads are somewhat less expensive than processes
  - Process control (creating and reaping) twice as expensive as thread control
  - Linux numbers:
    - ~20K cycles to create and reap a process
    - ~10K cycles (or less) to create and reap a thread

Posix Threads (Pthreads) Interface#

Pthreads: Standard interface for ~60 functions that manipulate threads from C programs
- Creating and reaping threads
  - pthread_create()
  - pthread_join()
- Determining your thread ID
  - pthread_self()
- Terminating threads
  - pthread_cancel()
  - pthread_exit()
  - exit() terminates all threads , ret terminates current thread
- Synchronizing access to shared variables
  - pthread_mutex_init
  - pthread_mutex_[un]lock

The Pthreads “hello, world” Program#

1
void *thread(void *vargp);
2

3
int main() {
4
  pthread_t tid;
5

6
  pthread_create(&tid, NULL, thread, NULL);
7
  pthread_join(tid, NULL);
8
  exit(0);
9
}
10

11
void *thread(void *vargp) {
12
  printf("Hello, world!\n");
13
  return NULL;
14
}

Thread-Based Concurrent Echo Server#

1
void *thread(void *vargp);
2

3
int main(int argc, char **argv) {
4
  int listenfd, *connfdp;
5
  socklen_t clientlen;
6
  struct sockaddr_storage clientaddr;
7
  pthread_t tid;
8

9
  listenfd = open_listenfd(argv[1]);
10
  while (1) {
11
    clientlen = sizeof(struct sockaddr_storage);
12
    connfdp = malloc(sizeof(int));
13
    *connfdp = accept(listenfd, (SA *)&clientaddr, &clientlen);
14
    pthread_create(&tid, NULL, thread, connfdp);
15
  }
16
}
17

18
void *thread(void *vargp) {
19
  int connfd = *((int *)vargp);
20

21
  pthread_detach(pthread_self());
22
  free(vargp);
23
  echo(connfd);
24
  close(connfd);
25
  return NULL;
26
}

malloc of connected descriptor necessary to avoid deadly race (but still have subtle problem)
Run thread in detached mode
- Runs independently of other threads
- Reaped automatically (by kernel) when it terminates
Free storage allocated to hold connfd
Close connfd (important!)

Thread-based Server Execution Model#

Each client handled by individual peer thread
Threads share all process state except TID
Each thread has a separate stack for local variables

Issues With Thread-Based Servers#

Must run “detached” to avoid memory leak
- At any point in time, a thread is either joinable or detached
- Joinable thread can be reaped and killed by other threads
  - must be reaped (with pthread_join) to free memory resources
- Detached thread cannot be reaped or killed by other threads
  - resources are automatically reaped on termination
- Default state is joinable
  - use pthread_detach(pthread_self()) to make detached
Must be careful to avoid unintended sharing
- For example, passing pointer to main thread’s stack
  - pthread_create(&tid, NULL, thread, (void *)&connfd);
- All functions called by a thread must be thread-safe

Pros and Cons of Thread-Based Designs#

Easy to share data structures between threads
- e.g., logging information, file cache
Threads are more efficient than processes
Unintentional sharing can introduce subtle and hard-to-reproduce errors !
- The ease with which data can be shared is both the greatest strength and the greatest weakness of threads
- Hard to know which data shared & which private
- Hard to detect by testing
  - Probability of bad race outcome very low
  - But nonzero !

Synchronization#

Shared Variables in Threaded C Programs#

Question: Which variables in a threaded C program are shared ?
- The answer is not as simple as “global variables are shared” and “stack variables are private”
Def: A variable x is shared if and only if multiple threads reference some instance of x
Requires answers to the following questions:
- What is the memory model for threads ?
- How are instances of variables mapped to memory ?
- How many threads might reference each of these instances ?

Threads Memory Model#

Conceptual model:
- Multiple threads run within the context of a single process
- Each thread has its own separate thread context
  - Thread ID, stack, stack pointer, PC, condition codes, and GP registers
- All threads share the remaining process context
  - Code, data, heap, and shared library segments of the process virtual address space
  - Open files and installed handlers
Operationally, this model is not strictly enforced:
- Register values are truly separate and protected, but…
- Any thread can read and write the stack of any other thread

CAUTION
The mismatch between the conceptual and operation model is a source of confusion and errors.

1
void *thread(void *vargp);
2

3
char **ptr; /* global var */
4

5
int main() {
6
  long i;
7
  pthread_t tid;
8
  char *msgs[2] = {
9
    "Hello from foo",
10
    "Hello from bar"
11
  };
12

13
  ptr = msgs;
14
  for (i = 0; i < 2; i++)
15
    pthread_create(&tid, NULL, thread, (void *)i);
16
  pthread_exit(NULL);
17
}
18

19
void *thread(void *vargp) {
20
  long myid = (long)vargp;
21
  static int cnt = 0;
22

23
  printf("[%ld]: %s (cnt=%d)\n", myid, ptr[myid], ++cnt);
24
  return NULL;
25
}

Peer threads reference main thread’s stack indirectly through global ptr variable
ptr, cnt, and msgs are shared, i and myid are not shared

Example for Improper Synchronizing Threads#

1
void *thread(void *vargp);
2

3
/* Global shared variable */
4
volatile long cnt = 0;
5

6
int main(int argc, char **argv) {
7
  long niters;
8
  pthread_t tid1, tid2;
9

10
  niters = atoi(argv[1]);
11
  pthread_create(&tid1, NULL, thread, &niters);
12
  pthread_create(&tid2, NULL, thread, &niters);
13
  pthread_join(tid1, NULL);
14
  pthread_join(tid2, NULL);
15

16
  /* Check result */
17
  if (cnt != (2 * niters))
18
    printf("BOOM! cnt=%ld\n", cnt);
19
  else
20
    printf("OK cnt=%ld\n", cnt);
21
  exit(0);
22
}
23

24
void *thread(void *vargp) {
25
  long i, niters = *((long *)vargp);
26
  for (i = 0; i < niters; i++)
27
    cnt++;
28
  return NULL;
29
}

1
linux> ./badcnt 10000
2
OK cnt=20000
3
linux> ./badcnt 10000
4
BOOM! cnt=13051
5
linux>

cnt should equal 20,000. What went wrong ?

Assembly Code for Counter Loop#

Concurrent Execution#

Key idea: In general, any sequentially consistent interleaving is possible, but some give an unexpected result !
- $I_{i}$ denotes that thread $i$ executes instruction $I$
- $\%rdx_{i}$ is the content of $\%rdx$ in thread $i$ ‘s context

Incorrect ordering: two threads increment the counter, but the result is 1 instead of 2

And the following ordering is still wrong

We can analyze the behavior using a progress graph

Progress Graphs#

A progress graph depicts the discrete execution state space of concurrent threads
Each axis corresponds to the sequential order of instructions in a thread
Each point corresponds to a possible execution state $( I_{1} ,I_{2})$ $(I_{1}, I_{2})$
- E.g., $( L_{1} ,S_{2})$ denotes state where thread 1 has completed $L_{1}$ and thread 2 has completed $S_{2}$

Trajectories in Progress Graphs#

A trajectory is a sequence of legal state transitions that describes one possible concurrent execution of the threads

Critical Sections and Unsafe Regions#

$L$ , $U$ , and $S$ form a critical section with respect to the shared variable cnt
Instructions in critical sections (write some shared variable) should not be interleaved
Sets of states where such interleaving occurs form unsafe regions
Def: A trajectory is safe iff it does not enter any unsafe region
Claim: A trajectory is correct (write cnt) iff it is safe

Enforcing Mutual Exclusion#

Question: How can we guarantee a safe trajectory ?
Answer: We must synchronize the execution of the threads so that they can never have an unsafe trajectory
- i.e., need to guarantee mutually exclusive access for each critical section
Classic solution:
- Semaphores (Edsger Dijkstra)
Other approaches:
- Mutex and condition variables (Pthreads)
- Monitors (Java)

Semaphores#

Semaphore: non-negative global integer synchronization variable. Manipulated by P and V operations (P and V correspond to the dutch words Proberen and Verhogen respectively)
P(s)
- If s is nonzero, then decrement s by 1 and return immediately
  - Test and decrement operations occur atomically (indivisibly)
- If s is zero, then suspend thread until s becomes nonzero and the thread is restarted by a V operation
- After restarting, the P operation decrements s and returns control to the caller
V(s)
- Increment s by 1
  - Increment operation occurs atomically
- If there are any threads blocked in a P operation waiting for s to become non-zero, then restart exactly one of those threads, which then completes its P operation by decrementing s
Semaphore invariant: $s\geqslant 0$

Using Semaphores for Mutual Exclusion#

1
#include <semaphore.h>
2

3
int sem_init(sem_t *s, 0, unsigned int val); /* s = val */
4
int sem_wait(sem_t *s); /* P(s) */
5
int sem_post(sem_t *s); /* V(s) */

Basic idea:
- Associate a unique semaphore mutex, initially 1, with each shared variable (or related set of shared variables)
- Surround corresponding critical sections with P(mutex) and V(mutex) operations
Terminology:
- Binary semaphore: semaphore whose value is always 0 or 1
- Mutex: binary semaphore used for mutual exclusion
  - P operation: “locking” the mutex
  - V operation: “unlocking” or “releasing” the mutex
  - “Holding” a mutex: locked and not yet unlocked
- Counting semaphore: used as a counter for set of available resources

Fix for Improper Synchronizing Threads#

Define and initialize a mutex for the shared variable cnt:

1
volatile long cnt = 0;  /* Counter */
2
sem_t mutex;            /* Semaphore that protects cnt */
3
sem_init(&mutex, 0, 1); /* mutex = 1 */

Surround critical section with P and V:

1
for (i = 0; i < niters; i++) {
2
  sem_wait(&mutex);
3
  cnt++;
4
  sem_post(&mutex);
5
}

WARNING
Its orders of magnitude slower than improper one.

Why Mutexes Work#

Provide mutually exclusive access to shared variable by surrounding critical section with P and V operations on semaphore s (initially set to 1)
Semaphore invariant creates a forbidden region that encloses unsafe region and that cannot be entered by any trajectory

Using Semaphores to Coordinate Access to Shared Resources#

Basic idea: Thread uses a semaphore operation to notify another thread that some condition has become true
- Use counting semaphores to keep track of resource state and to notify other threads
- Use mutex to protect access to resource
Two classic examples:
- The Producer-Consumer Problem
- The Readers-Writers Problem

Producer-Consumer Problem#

Common synchronization pattern:
- Producer waits for empty slot, inserts item in buffer, and notifies consumer
- Consumer waits for item, removes it from buffer, and notifies producer
Examples:
- Multimedia processing:
  - Producer creates MPEG video frames, consumer renders them
- Event-driven graphical user interfaces
  - Producer detects mouse clicks, mouse movements, and keyboard hits and inserts corresponding events in buffer
  - Consumer retrieves events from buffer and paints the display

Producer-Consumer on an n-element Buffer#

Requires a mutex and two counting semaphores:
- mutex: enforces mutually exclusive access to the buffer
- slots: counts the available slots in the buffer
- items: counts the available items in the buffer
Implemented using a shared circular (ring) buffer package called sbuf

sbuf Package#

1
typedef struct {
2
  int *buf;    /* Buffer array */
3
  int n;       /* Maximum number of slots */
4
  int front;   /* buf[(front+1)%n] is first item */
5
  int rear;    /* buf[rear%n] is last item */
6
  sem_t mutex; /* Protects accesses to buf */
7
  sem_t slots; /* Counts available slots */
8
  sem_t items; /* Counts available items */
9
} sbuf_t;
10

11
void sbuf_init(sbuf_t *sp, int n);
12
void sbuf_deinit(sbuf_t *sp);
13
void sbuf_insert(sbuf_t *sp, int item);
14
int sbuf_remove(sbuf_t *sp);

1
/* Create an empty, bounded, shared FIFO buffer with n slots */
2
void sbuf_init(sbuf_t *sp, int n) {
3
  sp->buf = calloc(n, sizeof(int));
4
  sp->n = n;                  /* Buffer holds max of n items */
5
  sp->front = sp->rear = 0;   /* Empty buffer iff front == rear */
6
  sem_init(&sp->mutex, 0, 1); /* Binary semaphore for locking */
7
  sem_init(&sp->slots, 0, n); /* Initially, buf has n empty slots */
8
  sem_init(&sp->items, 0, 0); /* Initially, buf has 0 items */
9
}
10

11
/* Clean up buffer sp */
12
void sbuf_deinit(sbuf_t *sp) {
13
  free(sp->buf);
14
}
15

16
/* Insert item onto the rear of shared buffer sp */
17
void sbuf_insert(sbuf_t *sp, int item) {
18
  P(&sp->slots);                        /* Wait for available slot */
19
  P(&sp->mutex);                        /* Lock the buffer */
20
  sp->buf[(++sp->rear)%(sp->n)] = item; /* Insert the item */
21
  V(&sp->mutex);                        /* Unlock the buffer */
22
  V(&sp->items);                        /* Announce available item */
23
}
24

25
/* Remove and return the first item from buffer sp */
26
int sbuf_remove(sbuf_t *sp) {
27
  int item;
28
  P(&sp->items);                         /* Wait for available item */
29
  P(&sp->mutex);                         /* Lock the buffer */
30
  item = sp->buf[(++sp->front)%(sp->n)]; /* Remove the item */
31
  V(&sp->mutex);                         /* Unlock the buffer */
32
  V(&sp->slots);                         /* Announce available slot */
33
  return item;
34
}

Readers-Writers Problem#

Generalization of the mutual exclusion problem
Problem statement:
- Reader threads only read the object
- Writer threads modify the object
- Writers must have exclusive access to the object
- Unlimited number of readers can access the object
Occurs frequently in real systems:
- Online airline reservation system
- Multithreaded caching Web proxy

Variants of Readers-Writers#

First readers-writers problem (favors readers)
- No reader should be kept waiting unless a writer has already been granted permission to use the object
- A reader that arrives after a waiting writer gets priority over the writer
Second readers-writers problem (favors writers)
- Once a writer is ready to write, it performs its write as soon as possible
- A reader that arrives after a writer must wait, even if the writer is also waiting
Starvation (where a thread waits indefinitely) is possible in both cases

Solution to First Readers-Writers Problem#

1
int readcnt; /* Initially = 0 */
2
sem_t mutex, w; /* Initially = 1 */
3

4
void reader(void) {
5
  while (1) {
6
    P(&mutex);
7
    readcnt++;
8
    if (readcnt == 1) /* First in */
9
      P(&w);
10
    V(&mutex);
11

12
    /* Critical section */
13
    /* Reading happens */
14

15
    P(&mutex);
16
    readcnt--;
17
    if (readcnt == 0) /* Last out */
18
      V(&w);
19
    V(&mutex);
20
  }
21
}
22

23
void writer(void) {
24
  while (1) {
25
    P(&w);
26

27
    /* Critical section */
28
    /* Writing happens */
29

30
    V(&w);
31
  }
32
}

Putting It All Together: Prethreaded Concurrent Server#

Prethreaded Concurrent Server#

1
void *thread(void *vargp);
2

3
sbuf_t sbuf; /* Shared buffer of connected descriptors */
4

5
int main(int argc, char **argv) {
6
  int i, listenfd, connfd;
7
  socklen_t clientlen;
8
  struct sockaddr_storage clientaddr;
9
  pthread_t tid;
10

11
  listenfd = open_listenfd(argv[1]);
12
  sbuf_init(&sbuf, SBUFSIZE);
13
  for (i = 0; i < NTHREADS; i++) /* Create worker threads */
14
    pthread_create(&tid, NULL, thread, NULL);
15
  while (1) {
16
    clientlen = sizeof(struct sockaddr_storage);
17
    connfd = accept(listenfd, (SA *)&clientaddr, &clientlen);
18
    sbuf_insert(&sbuf, connfd); /* Insert connfd in buffer */
19
  }
20
}
21

22
void *thread(void *vargp) {
23
  pthread_detach(pthread_self());
24
  while (1) {
25
    int connfd = sbuf_remove(&sbuf); /* Remove connfd from buf */
26
    echo_cnt(connfd);                /* Service client */
27
    close(connfd);
28
  }
29
}

1
/* echo_cnt initialization routine */
2
static int byte_cnt; /* Byte counter */
3
static sem_t mutex;  /* and the mutex that protects it */
4

5
static void init_echo_cnt(void) {
6
  sem_init(&mutex, 0, 1);
7
  byte_cnt = 0;
8
}
9

10
/* Worker thread service routine */
11
void echo_cnt(int connfd) {
12
  int n;
13
  char buf[MAXLINE];
14
  rio_t rio;
15
  static pthread_once_t once = PTHREAD_ONCE_INIT;
16

17
  pthread_once(&once, init_echo_cnt);
18
  rio_readinitb(&rio, connfd);
19
  while((n = rio_readlineb(&rio, buf, MAXLINE)) != 0) {
20
    P(&mutex);
21
    byte_cnt += n;
22
    printf("thread %d received %d (%d total) bytes on fd %d\n", (int)pthread_self(), n, byte_cnt, connfd);
23
    V(&mutex);
24
    rio_writen(connfd, buf, n);
25
  }
26
}

Crucial concept: Thread Safety#

Functions called from a thread must be thread‐safe
Def: A function is thread-safe iff it will always produce correct results when called repeatedly from multiple concurrent threads
Classes of thread-unsafe functions:
- Class 1: Functions that do not protect shared variables
- Class 2: Functions that keep state across multiple invocations
- Class 3: Functions that return a pointer to a static variable
- Class 4: Functions that call thread-unsafe functions

Thread-Unsafe Functions (Class 1)#

Failing to protect shared variables
- Fix: Use P and V semaphore operations
- Issue: Synchronization operations will slow down code

Thread-Unsafe Functions (Class 2)#

Relying on persistent state across multiple function invocations
- Example: Random number generator that relies on static state

1
static unsigned int next = 1;
2

3
/* rand: return pseudo-random integer on 0..32767 */
4
int rand(void) {
5
  next = next*1103515245 + 12345;
6
  return (unsigned int)(next/65536) % 32768;
7
}
8

9
/* srand: set seed for rand() */
10
void srand(unsigned int seed) {
11
  next = seed;
12
}

Thread-Safe Random Number Generator#

Pass state as part of argument
- and, thereby, eliminate global state

1
/* rand_r - return pseudo-random integer on 0..32767 */
2

3
int rand_r(int *nextp) {
4
  *nextp = *nextp * 1103515245 + 12345;
5
  return (unsigned int)(*nextp/65536) % 32768;
6
}

Consequence: programmer using rand_r must maintain seed

Thread-Unsafe Functions (Class 3)#

Returning a pointer to a static variable
Fix 1. Rewrite function so caller passes address of variable to store result
- Requires changes in caller and callee
Fix 2. Lock-and‐copy
- Requires simple changes in caller (and none in callee)
- However, caller must free memory

1
/* lock-and-copy version */
2
char *ctime_ts(const time_t *timep, char *privatep) {
3
  char *sharedp;
4

5
  P(&mutex);
6
  sharedp = ctime(timep);
7
  strcpy(privatep, sharedp);
8
  V(&mutex);
9
  return privatep;
10
}

Thread-Unsafe Functions (Class 4)#

Calling thread-unsafe functions
- Calling one thread-unsafe function makes the entire function that calls it thread-unsafe
- Fix: Modify the function so it calls only thread-safe functions

Reentrant Functions#

Def: A function is reentrant iff it accesses no shared variables when called by multiple threads
- Important subset of thread-safe functions
  - Require no synchronization operations
  - Only way to make a Class 2 function thread-safe is to make it reetnrant (e.g., rand_r)

Thread-Safe Library Functions#

All functions in the Standard C Library (at the back of your K&R text) are thread-safe
- Examples: malloc, free, printf, scanf
Most Unix system calls are thread-safe, with a few exceptions:

Thread-unsafe function	Class	Reentrant version
`asctime`	3	`asctime_r`
`ctime`	3	`ctime_r`
`gethostbyaddr`	3	`gethostbyaddr_r`
`gethostbyname`	3	`gethostbyname_r`
`inet_ntoa`	3	(none)
`localtime`	3	`localtime_r`
`rand`	2	`rand_r`

Races#

A race occurs when correctness of the program depends on one thread reaching point x before another thread reaches point y

1
void *thread(void *vargp);
2

3
/* A threaded program with a race */
4
int main() {
5
  pthread_t tid[N];
6
  int i;
7

8
  for (i = 0; i < N; i++)
9
    pthread_create(&tid[i], NULL, thread, &i);
10
  for (i = 0; i < N; i++)
11
    pthread_join(tid[i], NULL);
12
  exit(0);
13
}
14

15
/* Thread routine */
16
void *thread(void *vargp) {
17
  int myid = *((int *)vargp);
18
  printf("Hello from thread %d\n", myid);
19
  return NULL;
20
}

Race Illustration#

1
for (i = 0; i < N; i++)
2
  pthread_create(&tid[i], NULL, thread, &i);

Race between increment of i in main thread and deref of vargp in peer thread:
- If deref happens while i = 0, then OK
- Otherwise, peer thread gets wrong id value

Could this race really occur ?#

Main thread:

1
int i;
2
for (i = 0; i < 100; i++) {
3
  pthread_create(&tid, NULL, thread, &i);
4
}

Peer thread:

1
void *thread(void *vargp) {
2
  pthread_detach(pthread_self());
3
  int i = *((int *)vargp);
4
  save_value(i);
5
  return NULL;
6
}

Race Test
- If no race, then each thread would get different value of i
- Set of saved values would consist of one copy each of 0 through 99

Experimental Results#

Race Elimination#

1
/* Threaded program without the race */
2
int main() {
3
  pthread_t tid[N];
4
  int i, *ptr;
5

6
  for (i = 0; i < N; i++) {
7
    ptr = malloc(sizeof(int));
8
    *ptr = i;
9
    pthread_create(&tid[i], NULL, thread, ptr);
10
  }
11
  for (i = 0; i < N; i++)
12
    pthread_join(tid[i], NULL);
13
  exit(0);
14
}
15

16
/* Thread routine */
17
void *thread(void *vargp) {
18
  int myid = *((int *)vargp);
19
  free(vargp);
20
  printf("Hello from thread %d\n", myid);
21
  return NULL;
22
}

Deadlock#

Def: A process is deadlocked iff it is waiting for a condition that will never be true
Typical Scenario:
- Processes 1 and 2 needs two resources (A and B) to proceed
- Process 1 acquires A, waits for B
- Process 2 acquires B, waits for A
- Both will wait forever !

Deadlocking With Semaphores#

1
void *count(void *vargp);
2

3
int main() {
4
  pthread_t tid[2];
5
  sem_init(&mutex[0], 0, 1); /* mutex[0] = 1 */
6
  sem_init(&mutex[1], 0, 1); /* mutex[1] = 1 */
7
  pthread_create(&tid[0], NULL, count, (void*)0);
8
  pthread_create(&tid[1], NULL, count, (void*)1);
9
  pthread_join(tid[0], NULL);
10
  pthread_join(tid[1], NULL);
11
  printf("cnt=%d\n", cnt);
12
  exit(0);
13
}
14

15
void *count(void *vargp) {
16
  int i;
17
  int id = (int)vargp;
18
  for (i = 0; i < NITERS; i++) {
19
    P(&mutex[id]); P(&mutex[1-id]);
20
    cnt++;
21
    V(&mutex[id]); V(&mutex[1-id]);
22
  }
23
  return NULL;
24
}

Deadlock Visualized in Progress Graph#

Locking introduces the potential for deadlock
Any trajectory that enters the deadlock region will eventually reach the deadlock state, waiting for either $S_{0}$ or $S_{1}$ to become nonzero
Other trajectories luck out and skirt the deadlock region
Unfortunate fact: deadlock is often nondeterministic (race)

Avoiding Deadlock: Acquire shared resources in same order#

1
int main() {
2
  pthread_t tid[2];
3
  sem_init(&mutex[0], 0, 1); /* mutex[0] = 1 */
4
  sem_init(&mutex[1], 0, 1); /* mutex[1] = 1 */
5
  pthread_create(&tid[0], NULL, count, (void*) 0);
6
  pthread_create(&tid[1], NULL, count, (void*) 1);
7
  pthread_join(tid[0], NULL);
8
  pthread_join(tid[1], NULL);
9
  printf("cnt=%d\n", cnt);
10
  exit(0);
11
}
12

13
void *count(void *vargp) {
14
  int i;
15
  int id = (int) vargp;
16
  for (i = 0; i < NITERS; i++) {
17
    P(&mutex[0]); P(&mutex[1]);
18
    cnt++;
19
    V(&mutex[id]); V(&mutex[1-id]);
20
  }
21
  return NULL;
22
}

Avoided Deadlock in Progress Graph#

No way for trajectory to get stuck
Processes acquire locks in same order
Order in which locks released immaterial

Thread-Level Parallelism#

Parallel Computing Hardware
- Multicore
  - Multiple separate processors on single chip
- Hyperthreading
  - Efficient execution of multiple threads on single core

Typical Multicore Processor#

Multiple processors operating with coherent view of memory

Out-of-Order Processor Structure#

Instruction control dynamically converts program into stream of operations
Operations mapped onto functional units to execute in parallel

Hyperthreading Implementation#

Replicate enough instruction control to process K instruction streams
K copies of all registers
Share functional units

Example: Parallel Summation#

Sum numbers $0,...,n-1$ $0, ..., n - 1$
- Should add up to $\displaystyle \frac{n( n-1)}{2}$
Partition values $0,...,n-1$ $0, ..., n - 1$ into t ranges
- $\lfloor n/t\rfloor$ values in each range
- Each of t threads processes 1 range
- For simplicity, assume n is a multiple of t

First attempt: psum-mutex#

Simplest approach: Threads sum into a global variable protected by a semaphore mutex

1
void *sum_mutex(void *vargp);
2

3
long gsum = 0;          /* Global sum */
4
long nelems_per_thread; /* Number of elements to sum */
5
sem_t mutex;            /* Mutex to protect global sum */
6

7
int main(int argc, char **argv) {
8
  long i, nelems, log_nelems, nthreads, myid[MAXTHREADS];
9
  pthread_t tid[MAXTHREADS];
10
  /* Get input arguments */
11
  nthreads = atoi(argv[1]);
12
  log_nelems = atoi(argv[2]);
13
  nelems = (1L << log_nelems);
14
  nelems_per_thread = nelems / nthreads;
15
  sem_init(&mutex, 0, 1);
16

17
  /* Create peer threads and wait for them to finish */
18
  for (i = 0; i < nthreads; i++) {
19
    myid[i] = i;
20
    pthread_create(&tid[i], NULL, sum_mutex, &myid[i]);
21
  }
22

23
  for (i = 0; i < nthreads; i++)
24
    pthread_join(tid[i], NULL);
25

26
  /* Check final answer */
27
  if (gsum != (nelems * (nelems-1))/2)
28
    printf("Error: result=%ld\n", gsum);
29
  exit(0);
30
}
31

32
void *sum_mutex(void *vargp) {
33
  long myid = *((long *)vargp);          /* Extract thread ID */
34
  long start = myid * nelems_per_thread; /* Start element index */
35
  long end = start + nelems_per_thread;  /* End element index */
36
  long i;
37

38
  for (i = start; i < end; i++) {
39
    P(&mutex);
40
    gsum += i;
41
    V(&mutex);
42
  }
43
  return NULL;
44
}

psum-mutex Performance#

Shark machine with 8 cores, $n=2^{31}$

Nasty surprise:
- Single thread is very slow
- Gets slower as we use more cores

Next Attempt: psum-array#

Peer thread i sums into global array element psum[i]
Main waits for threads to finish, then sums elements of psum
Eliminates need for mutex synchronization

1
void *sum_array(void *vargp) {
2
  long myid = *((long *)vargp);          /* Extract thread ID */
3
  long start = myid * nelems_per_thread; /* Start element index */
4
  long end = start + nelems_per_thread;  /* End element index */
5
  long i;
6

7
  for (i = start; i < end; i++) {
8
    psum[myid] += i;
9
  }
10
  return NULL;
11
}

psum-array Performance#

Orders of magnitude faster than psum-mutex

Next Attempt: psum-local#

Reduce memory references by having peer thread i sum into a local variable (register)

1
void *sum_local(void *vargp) {
2
  long myid = *((long *)vargp);          /* Extract thread ID */
3
  long start = myid * nelems_per_thread; /* Start element index */
4
  long end = start + nelems_per_thread;  /* End element index */
5
  long i, sum = 0;
6

7
  for (i = start; i < end; i++) {
8
    sum += i;
9
  }
10
  psum[myid] = sum;
11
  return NULL;
12
}

psum-local Performance#

Significantly faster than psum-array

Characterizing Parallel Program Performance#

$p$ processor cores, $T_{k}$ is the running time using $k$ cores
Def. Speedup: $\displaystyle S_{p} =\frac{T_{1}}{T_{p}}$ $S_{p} = \frac{T _{1}}{T _{p}}$
- $S_{p}$ is relative speedup if $T_{1}$ is running time of parallel version of the code running on 1 core
- $S_{p}$ is absolute speedup if $T_{1}$ is running time of sequential version of code running on 1 core
- Absolute speedup is a much truer measure of the benefits of parallelism
Def. Efficiency: $\displaystyle E_{p} =\frac{S_{p}}{p} =\frac{T_{1}}{pT_{p}}$ $E_{p} = \frac{S _{p}}{p} = \frac{T _{1}}{p T _{p}}$
- Reported as a percentage in the range $( 0,100]$
- Measures the overhead due to parallelization

Performance of psum-local#

Efficiencies OK, not great
Our example is easily parallelizable
Real codes are often much harder to parallelize

Amdahl’s Law#

Gene Amdahl (Nov. 16, 1922 – Nov. 10, 2015)
Captures the difficulty of using parallelism to speed things up
Overall problem
- $T$ - Total sequential time required
- $p$ - Fraction of total that can be sped up ( $0\leqslant p\leqslant 1$ )
- $k$ - Speedup factor
Resulting Performance
- $\displaystyle T_{k} =\frac{pT}{k} +( 1-p) \cdot T$ $T_{k} = \frac{pT}{k} + (1 - p) \cdot T$
  - Portion which can be sped up runs $k$ times faster
  - Portion which cannot be sped up stays the same
- Least possible running time:
  - $k=\infty$
  - $T_{\infty } =( 1-p) \cdot T$

Amdahl’s Law Example#

Overall problem
- $T=10$
- $p=0.9$
- $k=9$
Resulting Performance
- $\displaystyle T_{9} =0.9\cdot \frac{10}{9} +0.1\cdot 10=1.0+1.0=2.0$
- Least possible running time:
  - $T_{\infty } =0.1\cdot 10.0=1.0$

A More Substantial Example: Sort#

Sort set of $N$ random numbers
Multiple possible algorithms
- Use parallel version of quicksort
Sequential quicksort of set of values $X$ $X$
- Choose “pivot” $p$ from $X$
- Rearrange $X$ $X$ into
  - $L$ : Values $\leqslant p$
  - $R$ : Values $\geqslant p$
- Recursively sort $L$ to get $\displaystyle L^{\prime }$
- Recursively sort $R$ to get $\displaystyle R^{\prime }$
- Return $\displaystyle L^{\prime } :p:R^{\prime }$

Sequential Quicksort Visualized#

Sequential Quicksort Code#

1
void qsort_serial(data_t *base, size_t nele) {
2
  if (nele <= 1)
3
    return;
4
  if (nele == 2) {
5
    if (base[0] > base[1])
6
      swap(base, base+1);
7
    return;
8
  }
9

10
  /* Partition returns index of pivot */
11
  size_t m = partition(base, nele);
12
  if (m > 1)
13
    qsort_serial(base, m);
14
  if (nele-1 > m+1)
15
    qsort_serial(base+m+1, nele-m-1);
16
}

Sort nele elements starting at base
- Recursively sort $L$ or $R$ if has more than one element

Parallel Quicksort#

Parallel quicksort of set of values $X$ $X$
- If $N\leqslant Nthresh$ , do sequential quicksort
- Else
  - Choose “pivot” $p$ from $X$
  - Rearrange $X$ $X$ into
    - $L$ : Values $\leqslant p$
    - $R$ : Values $\geqslant p$
  - Recursively spawn separate threads
    - Sort $L$ to get $\displaystyle L^{\prime }$
    - Sort $R$ to get $\displaystyle R^{\prime }$
- Return $\displaystyle L^{\prime } :p:R^{\prime }$

Parallel Quicksort Visualized#

Thread Structure: Sorting Tasks#

Task: Sort subrange of data
- Specify as:
  - base: Starting address
  - nele: Number of elements in subrange
Run as separate thread

Small Sort Task Operation#

Sort subrange using serial quicksort

Large Sort Task Operation#

Top-Level Function (Simplified)#

1
void tqsort(data_t *base, size_t nele) {
2
  init_task(nele);
3
  global_base = base;
4
  global_end = global_base + nele - 1;
5
  task_queue_ptr tq = new_task_queue();
6
  tqsort_helper(base, nele, tq);
7
  join_tasks(tq);
8
  free_task_queue(tq);
9
}

Sets up data structures
Calls recursive sort routine
Keeps joining threads until none left
Frees data structures

Recursive sort routine (Simplified)#

1
/* Multi-threaded quicksort */
2
static void tqsort_helper(data_t *base, size_t nele, task_queue_ptr tq) {
3
  if (nele <= nele_max_sort_serial) {
4
    /* Use sequential sort */
5
    qsort_serial(base, nele);
6
    return;
7
  }
8
  sort_task_t *t = new_task(base, nele, tq);
9
  spawn_task(tq, sort_thread, (void *) t);
10
}

Small partition: Sort serially
Large partition: Spawn new sort task

Sort task thread (Simplified)#

1
/* Thread routine for many-threaded quicksort */
2
static void *sort_thread(void *vargp) {
3
  sort_task_t *t = (sort_task_t *) vargp;
4
  data_t *base = t->base;
5
  size_t nele = t->nele;
6
  task_queue_ptr tq = t->tq;
7
  free(vargp);
8
  size_t m = partition(base, nele);
9
  if (m > 1)
10
    tqsort_helper(base, m, tq);
11
  if (nele-1 > m+1)
12
    tqsort_helper(base+m+1, nele-m-1, tq);
13
  return NULL;
14
}

Get task parameters
Perform partitioning step
Call recursive sort routine on each partition

Parallel Quicksort Performance#

Serial fraction: Fraction of input at which do serial sort
Sort $2^{27}$ (134,217,728) random values
Best speedup = 6.84X
Good performance over wide range of fraction values
- F too small: Not enough parallelism
- F too large: Thread overhead + run out of thread memory

Amdahl’s Law & Parallel Quicksort#

Sequential bottleneck
- Top-level partition: No speedup
- Second level: $\leqslant 2X$ speedup
- $k^{th}$ level: $\leqslant 2^{k-1} X$ speedup
Implications
- Good performance for small-scale parallelism
- Would need to parallelize partitioning step to get large-scale parallelism
  - Parallel Sorting by Regular Sampling (H. Shi & J. Schaeffer, J. Parallel & Distributed Computing, 1992)

Parallelizing Partitioning Step#

Experience with Parallel Partitioning#

Could not obtain speedup
Speculate: Too much data copying
- Could not do everything within source array
- Set up temporary space for reassembling partition

Lessons Learned#

Must have parallelization strategy
- Partition into k independent parts
- Divide-and-conquer
Inner loops must be synchronization free
- Synchronization operations very expensive
Beware of Amdahl’s Law
- Serial code can become bottleneck

Memory Consistency#

What are the possible values printed ?
- Depends on memory consistency model
- Abstract model of how hardware handles concurrent accesses
Sequential consistency
- Overall effect consistent with each individual thread
- Otherwise, arbitrary interleaving

Sequential Consistency Example#

Impossible outputs
- 100, 1 and 1, 100
- Would require reaching both Ra and Rb before Wa and Wb

Non-Coherent Cache Scenario#

Write-back caches, without coordination between them

Snoopy Caches#

Tag each cache block with state
- Invalid - Cannot use value
- Shared - Readable copy
- Exclusive - Writeable copy

When cache sees request for one of its E-tagged blocks
- Supply value from cache
- Set tag to S

前言#

Bits, Bytes, and Integers#

Representation information as bits#

Everything is bits#

Each bit is 0 or 1#

By encoding/interpreting sets of bits in various ways#

Why bits ? Electronic Implementation#

Bit-level manipulations#

Representing & Manipulating Sets#

Representing Sets#

Sets Operations#

Shift Operations#

Left Shift: x << y#

Right Shift: x >> y#

Undefined Behaviour#

Integers#

Representation: unsigned and signed#

Unsigned#

Two’s Complement#

Invert mappings#

Numeric Ranges#

Observations#

Difference between Unsigned & Signed Numeric Values#

Conversion, casting#

Constants#

Casting#

Expression Evaluation#

Expanding, truncating#

Sign Extension#

Expanding (e.g., short int to int)#

Truncating (e.g., unsigned to unsigned short)#

Addition, negation, multiplication, shifting#

Unsigned Addition#

Visualizing (Mathematical) Integer Addition#

Visualizing Unsigned Addition#

Two’s Complement Addition#

TAdd Overflow#

Visualizing Two’s Complement Addition#

Multiplication#

Unsigned Multiplication in C#

Signed Multiplication in C#

Power-of-2 Multiply with Shift#

Unsigned Power-of-2 Divide with Shift#

Signed Power-of-2 Divide with Shift#

Case 1: No rounding#

Case 2: Rounding#

Handy tricks#

When Should I Use Unsigned ?#

Representation in memory, pointers, strings#

Byte-Oriented Memory Organization#

Machine Words#

Word‐Oriented Memory Organization#

Byte Ordering#

Representation Strings#

Floating Point#

Background: Fractional binary numbers#

Representing#

Observations#

Limitation 1#

Limitation 2#

Definition: IEEE Floating Point Standard#

IEEE Standard 754#

Floating Point Representation#

Encoding#

Normalized Values#

Normalized Encoding Example#

Denormalized Values#

Special Values#

Visualization: Floating Point Encodings#

Example and properties#

Tiny Floating Point Example#

Dynamic Range (Positive Only)#

Distribution of Values#

Special Properties of the IEEE Encoding#

Rounding, addition, multiplication#

Floating Point Operations: Basic Idea#

Basic idea#

Rounding#

Closer Look at Round-To-Even#

Default Rounding Mode#

Left Shift: `x << y`#

Right Shift: `x >> y`#

General “While” Translation 1 (“Jump‐to-middle” translation, compiled with `-Og`)#

General “While” Translation 2 (Compiled with `-O1`)#