LuaJIT Source Code Primer

This is a compilation of key topics and concepts that I had to learn or revisit as I studied the internals of LuaJIT to understand how it sets itself apart. If you’re as rusty as I was, you might appreciate this recap.

DISCLAIMER: This article was generated by an LLM.

Chapter 1: Foundations Refreshed

1.1 The Role of a Compiler and JIT

Before diving into LuaJIT specifics, it’s useful to revisit what a compiler is and the distinctions between compilation strategies:

AOT (Ahead-of-Time) compilers transform source code into machine code before execution. Examples: GCC, Clang. This allows for maximum runtime performance but less runtime flexibility.
JIT (Just-In-Time) compilers defer some or all compilation to runtime. This enables dynamic optimization based on observed execution patterns. Examples: V8 (JavaScript), HotSpot (Java), LuaJIT (Lua).
Interpreters evaluate code line-by-line without compiling it to native code. These are simpler and easier to implement but generally slower. Example: CPython.

LuaJIT is a tracing JIT, a specific kind of JIT compiler that observes the runtime behavior of a program to identify hot paths—typically loops or frequently executed branches. These paths are recorded as linear sequences of operations called traces, which are then compiled to optimized machine code.

This approach yields very fast performance for dynamic languages, especially when the code inside loops stabilizes to predictable types and control flow.

1.2 Runtime Representation of Values

One of the key challenges in implementing dynamic languages like Lua is how to represent a variety of types in memory efficiently. LuaJIT solves this through NaN-tagging and pointer tagging.

NaN Tagging

NaN tagging leverages the IEEE-754 floating-point format. A double (64-bit floating point number) is laid out as follows:

1 bit: sign
11 bits: exponent
52 bits: mantissa (fraction)

The exponent field being all 1s (2047) indicates a NaN (Not a Number). The mantissa field in that case is used to distinguish different types of NaNs, and is ignored by normal floating-point operations.

LuaJIT exploits this: values that are not floating-point numbers are encoded using these NaN bit patterns. For example:

A value with a canonical NaN + certain bits in the mantissa may encode an integer, boolean, nil, or a GC object reference (e.g., table, string, function).
True double values are stored as-is if they are not NaNs.

This means that every TValue (tagged Lua value) is a 64-bit word, and the type information is embedded directly within it, saving both time and space compared to storing types and values separately.

Example Encoding

Here’s a simplified encoding layout:

Floating-point number: bits represent a valid double
GC object pointer: bits represent a NaN pattern + pointer value (e.g., 0xfff8000000000000 | object address)
Integer: encoded with a different NaN prefix
Boolean / Nil: encoded using specific reserved bit patterns

Tagged Object References

GC-tracked Lua objects (like tables, strings, and functions) are stored in heap-allocated memory. LuaJIT represents them using tagged pointers:

The pointer itself is stored inside the 64-bit TValue, but it is bitwise OR-ed with a type tag.
For instance, the bottom 3–4 bits of the pointer may be used to encode object type info. This works because heap allocations are aligned (usually 8 or 16 bytes), meaning those bits would otherwise be zero.

Tagged Pointer Memory Layout

Let’s say we have a table pointer 0x7ff8a0, which is 8-byte aligned. LuaJIT might represent it as:

0x7ff8a0 | TAG_TABLE   → 0x7ff8a5 (last 3 bits used to encode type)

To retrieve the raw pointer:

raw_ptr = tagged_ptr & ~0x7

To retrieve the type:

type_tag = tagged_ptr & 0x7

Pointer Tagging (More Generally)

Pointer tagging is not just about type discrimination. Other auxiliary data can be embedded, such as:

Object type (table, string, proto, closure, etc.)
Object immutability or special handling (e.g., interned string)
Ownership or GC generation flags
Fast-path condition hints (e.g., “does not have metatable”)

Pointer tagging is only viable when:

You know the alignment of the pointer (which gives you free bits).
You control the runtime and allocator.
You want high performance and are willing to trade off a bit of readability and debugging ease.

Unboxed Floating Point Numbers

In many dynamic language VMs, floating-point values must be heap-allocated (boxed) if they need to coexist with pointers in generic containers (like an array of TValue). This causes major performance problems due to extra indirection and memory pressure.

LuaJIT avoids this by not boxing floating point numbers:

If a value is a valid double and not a NaN, it is stored directly.
This avoids memory allocation and allows for faster arithmetic.

This optimization works because of the NaN-tagging scheme—types are known from bits alone, not separate metadata.

1.3 Stack Slots vs Table Slots

Stack Slots

In LuaJIT, stack slots are used for function arguments, locals, and temporary values during bytecode execution.

Each Lua function gets a stack frame containing a number of slots.
Instructions access these slots using indexes encoded in bytecode.
Unlike a traditional call stack (like in C), LuaJIT’s stack is more like a fixed array that the VM moves over and repurposes.

This slot-based architecture supports efficient interpretation and JIT tracing, since register allocation can map slots directly to physical registers or known stack offsets.

Table Slots

Lua tables in LuaJIT are implemented with two parts:

Array Part:
- Stores integer keys from 1 to n contiguously.
- Fast indexed access without hash lookup.
- Resized as needed to accommodate numeric keys.
Hash Part:
- Stores string keys, mixed keys, and non-sequential integer keys.
- Uses open addressing with chaining or skip-lists (in LuaJIT).

Skip-List Lookup Chains

A skip list is a layered linked list that allows fast search.
LuaJIT uses this to make key lookup adaptive:
- Rare keys get slow path
- Hot keys become inline cache candidates

This organization supports Lua’s dynamic, flexible table semantics while allowing optimizations like:

Inline CSE of table lookups
Specialization on key type (string, int, float, etc.)

In the next chapter, we’ll cover call frames, function invocation internals, and how LuaJIT handles execution state across nested calls and VM boundaries.

If you want to deep-dive further into tagged value layouts or NaN-tagging implementation in C code, we can also annotate relevant structs from lj_obj.h in the LuaJIT source.

Chapter 2: The Call Stack and Execution Frames

2.1 Call Frames in LuaJIT

When a Lua function is invoked, the virtual machine needs to set up an activation record or call frame to store:

function arguments
local variables
return address (where to continue after the call)
bookkeeping data (e.g., for upvalues or variable arguments)

LuaJIT, being stack-based, does not use the C call stack for each Lua function. Instead, it manages its own virtual stack in a large preallocated memory block. This design gives it flexibility and eliminates many of the costs of native calls.

Components of a LuaJIT Call Frame

A typical call frame in LuaJIT contains:

Function header (metadata):
- Flags: e.g., if it’s a Lua or C function
- Frame size
- Link to previous frame
Stack slots:
- Fixed-size array used to store arguments, locals, and temporaries
Frame pointer (base):
- Points to the beginning of the current function’s stack slice
Top pointer (top):
- Points to the upper bound of the active slots

This layout is maintained in the lua_State (thread) structure and manipulated using internal APIs.

Frame Linkage

Call frames are chained together to support nested calls. Each frame stores a pointer to the previous one. This structure allows for unwinding the stack during return or error handling.

This also plays a role in trace stitching, where a trace ends in a function call and another trace resumes when the called function returns.

Fast Function Calls

LuaJIT goes further: many function calls (especially monomorphic ones) are inlined or lowered into extremely fast jumps or tailcalls when traced. In some cases:

Tail-recursive calls do not grow the call frame stack.
Small helper functions can be inlined entirely into the trace.

2.2 Slot-Based Execution Model

LuaJIT’s VM executes bytecode using a slot-based execution model, somewhat like a register machine.

Each instruction operates on slots, which are indexes into the function’s frame.
A typical instruction might say “add slot 3 and 4, store result in slot 5.”

This is significantly faster than a stack-based VM (like the original Lua interpreter), which constantly pushes and pops values.

Advantages of Slots

Enables random access to operands.
Favors register allocation during JIT.
Tracing becomes easier: value flow can be analyzed statically over slots.

Slot Lifetime and Reuse

LuaJIT performs liveness analysis to determine how long a slot is needed. Slots can be reused aggressively once values go dead. This helps:

Reduce memory usage
Improve register pressure
Avoid unnecessary spills

2.3 Call Frames and the JIT Compiler

When tracing, LuaJIT captures the semantics of what the Lua VM is doing—this includes function calls.

Trace Entry and Exit

A trace begins at a hot loop or a frequently executed function entry point. The trace records VM operations across function calls and captures the values in the relevant slots.

If the trace exits (e.g., a type mismatch or function call not matching the expected shape), control jumps back to the interpreter.

Inline Function Calls

The JIT compiler aggressively inlines small, predictable functions during trace generation. This means:

Call frames may be collapsed in the IR
Stack slots for inlined functions are mapped directly into the parent function’s stack frame

This design lets LuaJIT:

Avoid the cost of frame setup/teardown
Constant-fold or optimize through function boundaries
Specialize traces on specific function calls

2.4 Vararg Functions and C Functions

Lua allows functions to take a variable number of arguments (...), and this has to be handled in call frame setup:

The extra arguments are stored contiguously in the stack beyond the fixed parameter list.
LuaJIT adjusts the top pointer to track their extent.
A VARARG instruction in bytecode moves or reads these values into usable slots.

For C functions (i.e., Lua functions implemented in C via the C API), LuaJIT builds a special call frame that hands over execution to the C runtime. These are not traced, but the VM ensures the correct stack layout and GC state is preserved.

2.5 Return Values and Frame Rewinding

Returning from a function involves:

Restoring the previous frame pointer
Moving the return value(s) into the caller’s slot(s)
Adjusting the top pointer

In traced code, LuaJIT knows ahead of time how many values are returned and to which slots. This allows elision of many dynamic checks or memory movement operations during trace execution.

In Chapter 3, we’ll dive into intermediate representations, and how LuaJIT uses a powerful SSA-based, orthogonal IR to drive its optimization pipeline. Let me know if you’d like me to continue with that now.

Chapter 3: Intermediate Representation in LuaJIT

3.1 What is an Intermediate Representation (IR)?

An intermediate representation (IR) is a data structure or code format used internally by a compiler to represent source code in a way that facilitates analysis and transformation. It serves as the bridge between the front end (parsing and bytecode interpretation) and the back end (code generation and optimization).

LuaJIT’s JIT compiler translates hot bytecode traces into a highly compact and powerful IR format that is:

SSA-based (Static Single Assignment)
Orthogonal
Pointer-free

Each IR instruction is a 64-bit structure representing a low-level operation.

3.2 SSA-Based IR

Static Single Assignment (SSA) means that each variable is assigned exactly once. Every new value gets a new unique name (or register).

This has several advantages:

Simplifies data flow analysis
Enables many compiler optimizations like constant propagation, dead code elimination, and register allocation
Makes dependencies between values explicit, which is ideal for tracing.

In LuaJIT, the IR is generated in SSA form from the very beginning of trace recording.

Example (simplified IR in SSA)

i1 = add i0, 1
i2 = mul i1, 2

Each value (i1, i2) is computed once and never overwritten. You can always track a variable’s source unambiguously.

Phi Nodes

When control flow branches (e.g., in if or loops), SSA needs phi nodes to merge values:

if (x) then
    a = 1
else
    a = 2
end
b = a + 3

Becomes:

a1 = 1
...
a2 = 2
a3 = phi(a1, a2)
b = add a3, 3

LuaJIT’s IR includes phi nodes but often elides them when traces are linearized.

3.3 Orthogonal IR Design

An IR is orthogonal when the operations it defines are independent of operand types and instruction variants. This reduces the number of opcodes needed and simplifies optimization passes.

LuaJIT’s IR:

Has fewer than 100 opcodes (see lj_ir.h)
Operands are all references to other IR instructions
Instruction semantics are uniform (e.g., no implicit type coercion)

This allows for small, composable transformations and compact in-memory representation.

Each IR instruction contains:

An opcode (e.g., ADD, LOAD, CALL)
Up to two operands (by reference/index)
A type (e.g., int, num, str, tab, etc.)
Auxiliary fields for constants, jump targets, etc.

3.4 Pointer-Free IR

LuaJIT’s IR avoids using direct memory pointers between instructions.

Instead, instructions are stored in a linear array and referenced by index. This:

Prevents aliasing and pointer cycles
Supports relocation and duplication
Facilitates compact GC

Benefits

Faster allocation and traversal
Better cache coherence
Easier serialization, debugging, and IR transformations

3.5 The FOLD Engine

FOLD is the name of LuaJIT’s constant folding and simplification engine. It is invoked during trace recording to simplify expressions on the fly.

FOLD operates on IR as it is being built:

If operands to an IR instruction are constants, the result can be evaluated immediately.
If the result is redundant (e.g., x + 0), the instruction is eliminated.

Examples:

add 2, 3 → replaced with const 5
and x, 0xFF when x is known → precompute mask

The FOLD engine avoids creating unnecessary IR, reducing trace size and speeding up compilation.

In many ways, it is a mini abstract interpreter, executing parts of the program at trace time to simplify them.

In the next chapter, we will explore how LuaJIT selects what code to trace, how it handles loops and branches, and the concepts of root traces, side traces, and natural loop first heuristics.

Chapter 4: Tracing and Region Compilation

4.1 What is Tracing Compilation?

Tracing JIT compilers do not compile whole functions or modules up front. Instead, they monitor code as it runs and identify hot paths—typically tight loops or frequently taken branches.

Once a hot path is detected, the VM:

Starts recording a linear sequence of operations (a trace) as it executes bytecode.
Transforms this trace into IR (Intermediate Representation).
Applies optimizations and simplifications (e.g., constant folding, CSE).
Emits native machine code.

The resulting native code is inserted into the runtime so that next time the same path is reached, it jumps directly to compiled code.

LuaJIT’s traces can span multiple function calls (when inlined), loop iterations, and conditions.

4.2 Root Traces and Side Traces

Root Trace

A root trace starts at a top-level hot loop or conditional branch. This is the first trace compiled for a region.

Example:

for i = 1, 1000 do
  sum = sum + arr[i]
end

If this loop becomes hot, LuaJIT begins a root trace starting at the loop header.

Side Trace

A side trace branches off from a guard failure in an existing trace.

Suppose the trace expects arr[i] to be a number, but at iteration 500, it becomes a string. The current trace exits, and LuaJIT starts recording a new trace starting from the failure point. This side trace handles the “exceptional” case.

Guards are inserted throughout traces to verify assumptions made during tracing (types, control flow, etc.). If any guard fails:

Control returns to the interpreter, or
A side trace is activated or recorded

This trace tree architecture allows LuaJIT to specialize paths and recover gracefully from dynamic behavior.

4.3 Trace Selection: Natural Loop First (NLF)

LuaJIT uses a natural loop first (NLF) strategy to guide what to trace.

Natural Loops

A natural loop is a block of code with:

A single entry point (called the header)
One or more back edges (where control jumps to the header)

These loops are ideal tracing candidates because:

They execute many times with stable control flow
They often stabilize types and object shapes
They allow better inlining and specialization

NLF ensures tracing begins at the most profitable regions first—inner loops and stable conditional blocks.

Trace Thresholds

The VM keeps counters at loop headers. When a counter exceeds a hotness threshold, tracing is initiated.

You can tweak these thresholds with LuaJIT’s internal options (e.g., jit.opt.start(“hotloop=50”)).

Outer Traces and Tail Calls

NLF also helps avoid tracing large, monolithic functions. By breaking execution into traces focused on loops, LuaJIT can apply aggressive optimizations locally without overgeneralizing.

Tail-recursive functions are often turned into self-recursive traces without growing the trace tree.

4.4 Guards and Control Flow

Guards are conditions inserted into the compiled trace to check assumptions. Examples:

Type of a slot
Metatable absence
Object identity (e.g., same function or table)

When a guard fails, LuaJIT exits the trace and resumes interpretation or jumps to a side trace.

This design is crucial:

It allows specialization without compromising correctness
The fallback path always exists (safe-by-default)
It avoids compiling overly generic code

4.5 Trace Linking and Stitching

Once multiple traces are compiled, LuaJIT can link them at runtime:

A root trace may directly jump to a side trace if a guard fails
A return path may link to another trace that handles the callee’s return
Loops may be closed by linking the end of a trace to its own start (forming a loop)

Trace stitching allows the program to run entirely in compiled mode across complex control flow.

Benefits

Reduces transitions between interpreter and compiled code
Enables trace fusion across functions
Preserves high performance in hot loops with dynamic behavior

In Chapter 5, we’ll study how LuaJIT performs optimization, including CSE, DSE, alias analysis, and the subtle trade-offs involved in a trace compiler.

Chapter 5: Optimization Strategies

One of LuaJIT’s standout features is its highly optimized trace compiler. The IR traces recorded during runtime are subjected to a suite of optimization passes to reduce runtime overhead, eliminate redundancy, and exploit predictable behavior. This chapter dives into some of the core strategies involved.

5.1 Classical Compiler Optimizations in LuaJIT

Common Subexpression Elimination (CSE)

CSE removes redundant computations by identifying when an expression has already been evaluated and reusing its result.

Example:

x = a + b
...
y = a + b

Becomes:

x = a + b
y = x

In IR terms, if two instructions have the same opcode and operands, and neither operand has changed, the second can reuse the first’s result. LuaJIT’s FOLD engine performs this early during IR construction.

Benefits:

Reduces trace length
Saves computation time
Lowers register pressure

Dead Store Elimination (DSE)

DSE removes writes to memory or stack slots that are later overwritten without being read.

Example:

x = 1
x = 2  -- the previous write is dead

LuaJIT can eliminate the first assignment entirely, recognizing that its value is never observed.

Dead stores arise from:

Unused variable assignments
Redundant stack writes
Temporary IR slots overwritten without reads

Constant Folding and Propagation

This is handled by the FOLD engine:

Computes constant expressions at trace time
Pushes constant values forward into dependent instructions

Benefits:

Simplifies IR
Enables further optimizations (e.g., branch elimination)

5.2 Alias Analysis

Lua is a dynamic language with mutable tables, closures, and upvalues. To optimize memory access safely, the compiler needs to know whether two variables may refer to the same memory. This is the job of alias analysis.

Challenges in Lua

Tables can grow or shrink dynamically
Upvalues (closed-over variables) can alias across stack frames
Metatables can intercept field access and assignment

LuaJIT performs conservative alias analysis:

Assumes unknown tables or upvalues may alias
Avoids optimizing across aliasing memory unless proven otherwise

For example, two different table fields might alias if table identity isn’t known, so LuaJIT may insert guards or fall back to the interpreter.

To avoid pessimism, LuaJIT specializes traces:

On object identity (guard obj == T) → allows direct field access
On metatable absence → enables raw access

This lets safe, fast paths dominate while preserving correctness.

5.3 Skip-List Chains

Lua tables are backed by both an array part and a hash part. The hash part may be implemented using skip-list chains, which allow fast key lookup via multi-level linked lists.

Advantages over linear probing:

Better average lookup time
Adaptive to changing key distributions

LuaJIT specializes traces on hot table accesses by:

Guarding on key type and presence
Caching field offsets or skip-list steps

This lets common field accesses (like obj.x) compile into single load instructions after guards.

5.4 Hoisting and Sinking

Hoisting moves invariant code out of loops.
Sinking defers computations or allocations until their results are needed.

LuaJIT may hoist:

Metatable checks
Loop-invariant table lookups
Constant conversions (e.g., tonumber(x) when x is known)

Sinking is useful for memory allocations:

Delays GC pressure
Eliminates unused allocations if control exits early

5.5 Trace Specialization

Every trace is specialized based on the runtime state during recording:

Value types (int, number, string, table, etc.)
Object identities (specific function/table/string)
Control flow (which branch taken)

This allows highly optimized code generation but introduces fragility—if any assumption breaks, guards force a trace exit.

In practice, traces are very stable for numerical loops, pure functions, and object method calls.

In Chapter 6, we’ll explore how these design decisions impact hardware-level performance, including data cache behavior, code layout, and memory access patterns.

Chapter 6: Hardware and Cache Awareness

LuaJIT’s performance isn’t solely due to clever compilation strategies and IR design—it also excels by being cache-friendly and minimizing memory pressure. This chapter focuses on how LuaJIT interacts with the underlying hardware, particularly the data cache (D-cache), instruction cache (I-cache), and memory locality.

6.1 D-Cache Impact

Modern CPUs access memory through multiple levels of cache. The data cache (D-cache) stores recently accessed data, and cache misses can cost hundreds of CPU cycles.

LuaJIT is optimized to:

Keep frequently accessed data in D-cache
Use compact memory layouts
Minimize pointer chasing

Key Techniques

Slot arrays in the VM are linearly accessed and packed tightly.
Tables are allocated in a way that clusters the array part and hash part contiguously.
IR instructions and traces are stored in flat, index-addressable arrays.
The JIT compiler coalesces live data to improve spatial locality.

This design reduces the chance of cache thrashing and makes speculative loads (like from traces) more likely to hit.

6.2 Code Layout and I-Cache

The instruction cache (I-cache) stores recently executed machine code. If JIT-compiled code is large, complex, or fragmented, I-cache misses can degrade performance.

LuaJIT mitigates this by:

Emitting tight, linear traces with few jumps
Aligning code regions to cache-line boundaries
Separating cold and hot paths

Cold Path Separation

Rarely executed error checks, metatable fallbacks, or guards that almost never fail are emitted to cold sections of the trace.

This:

Keeps hot paths small and dense
Allows the CPU to predict branches more accurately
Improves I-cache efficiency

6.3 Allocation and Memory Pressure

Memory allocation has a direct impact on performance and cache usage.

LuaJIT’s object allocator:

Uses freelists and arenas for small objects
Allocates frequently used objects (like strings and closures) from dedicated pools
Recycles stack frames and upvalue containers to reduce churn

The garbage collector (GC) is tuned for low overhead:

It avoids sweeping during trace execution
Allocation pressure is monitored during trace recording to prevent generating unstable traces

This GC-awareness ensures that allocations do not break cache locality or evict valuable data during hot path execution.

6.4 IR and Cache Behavior

The IR is stored as a linear array of 64-bit instructions. Since each instruction is referenced by a numeric ID (not a pointer), the memory layout is dense and friendly to the D-cache.

Benefits:

Prefetching is easier and more predictable
Random access to IR is avoided; most passes traverse linearly
No pointer chasing means fewer cache line spills

6.5 Trace Execution and CPU Pipelines

JIT-compiled traces are optimized for modern CPU execution pipelines:

Fewer branches = better branch prediction
Fewer loads and stores = fewer memory stalls
Tight loops encourage instruction-level parallelism (ILP)

Example:

A simple numeric loop in LuaJIT compiles to a short, branchless loop that:

Loads loop counter
Performs arithmetic
Stores result
Jumps back (or exits on guard)

These traces can execute entirely out of the CPU’s L1 cache with no pipeline stalls.

6.6 Summary of Hardware-Aware Design

Design Element	Hardware Benefit
Slot-based VM	Linear access = D-cache friendly
SSA-based IR	Reduced memory pressure
Guard separation	Smaller hot paths = I-cache win
Dense IR layout	Prefetching, no pointer chasing
Trace specialization	Fewer branches = pipeline win
Cold path separation	Reduces I-cache pollution

In the Appendix, we’ll provide a glossary of terms and references to specific LuaJIT source files and further reading.

Appendix: Glossary and Further Reading

A. Glossary of Terms

NaN Tagging: A technique that embeds type information into the bit pattern of IEEE-754 NaN values to represent various data types within a single 64-bit value.
Tagged Pointer: A pointer that stores additional metadata in unused low bits due to memory alignment guarantees.
Unboxed Float: A floating-point number stored directly in a value container rather than as a pointer to heap memory.
Slot: A fixed-size cell in the VM’s virtual stack used to store temporary values, arguments, and locals.
Call Frame: A data structure representing the state of a function call, including arguments, local variables, and return addresses.
SSA (Static Single Assignment): An IR form where each variable is assigned exactly once, improving optimization clarity.
Orthogonal IR: An IR design where operations are uniform and independent of operand types, enabling simpler and more flexible optimizations.
FOLD Engine: LuaJIT’s constant folding engine, which simplifies IR instructions during trace recording.
Root Trace: The primary trace recorded from a hot loop or branch.
Side Trace: A trace recorded from a guard failure or exceptional control path within a root trace.
NLF (Natural Loop First): A trace selection heuristic that prioritizes loops with single entry points for tracing.
Guard: A runtime check inserted into traces to verify that assumptions (type, value, control flow) made during recording still hold.
CSE (Common Subexpression Elimination): Optimization that avoids re-evaluating expressions that have already been computed.
DSE (Dead Store Elimination): Optimization that removes assignments whose results are never used.
Alias Analysis: The process of determining whether two variables can refer to the same memory location.
Skip-List Chain: A probabilistic data structure used for fast key lookup in table hash parts.
D-Cache / I-Cache: CPU-level memory caches for data and instructions, respectively.

B. Recommended Source Files in LuaJIT

These files are critical to understanding LuaJIT’s internals:

lj_obj.h — Core type definitions and NaN-tagging layout
lj_ir.h — Definition of IR instructions and their metadata
lj_asm.c — Assembly code generator backend
lj_record.c — Trace recorder: generates IR from bytecode
lj_opt_fold.c — Constant folding (FOLD engine)
lj_trace.c — Trace management and linking
lj_vm.h / lj_vm.s — VM definitions and low-level assembly

C. Further Reading and References

Papers and Articles:

“Trace-based Just-in-Time Type Specialization for Dynamic Languages” — Andreas Gal et al. (on which LuaJIT is heavily inspired)
LuaJIT Wiki and FAQ: https://luajit.org
“The Implementation of Lua 5.0” — Roberto Ierusalimschy et al.
“A Survey of Intermediate Representations in Modern Compilers” — useful for SSA/IR theory

Tools for Exploration:

Use -jv with LuaJIT to view JIT traces (luajit -jv script.lua)
jit.util module provides low-level inspection
Visualize traces with third-party tools or via modified lj_vmevent.c

Hyperlinks

Continue

Outline

Chapter 1: Foundations Refreshed

1.1 The Role of a Compiler and JIT

1.2 Runtime Representation of Values

NaN Tagging

Example Encoding

Tagged Object References

Tagged Pointer Memory Layout

Pointer Tagging (More Generally)

Unboxed Floating Point Numbers

1.3 Stack Slots vs Table Slots

Stack Slots

Table Slots

Skip-List Lookup Chains

Chapter 2: The Call Stack and Execution Frames

2.1 Call Frames in LuaJIT

Components of a LuaJIT Call Frame

Frame Linkage

Fast Function Calls

2.2 Slot-Based Execution Model

Advantages of Slots

Slot Lifetime and Reuse

2.3 Call Frames and the JIT Compiler

Trace Entry and Exit

Inline Function Calls

2.4 Vararg Functions and C Functions

2.5 Return Values and Frame Rewinding

Chapter 3: Intermediate Representation in LuaJIT

3.1 What is an Intermediate Representation (IR)?

3.2 SSA-Based IR

Phi Nodes

3.3 Orthogonal IR Design

3.4 Pointer-Free IR

Benefits

3.5 The FOLD Engine

Examples:

Chapter 4: Tracing and Region Compilation

4.1 What is Tracing Compilation?

4.2 Root Traces and Side Traces

Root Trace

Side Trace

4.3 Trace Selection: Natural Loop First (NLF)

Natural Loops

Trace Thresholds

Outer Traces and Tail Calls

4.4 Guards and Control Flow

4.5 Trace Linking and Stitching

Benefits

Chapter 5: Optimization Strategies

5.1 Classical Compiler Optimizations in LuaJIT

Common Subexpression Elimination (CSE)

Dead Store Elimination (DSE)

Constant Folding and Propagation

5.2 Alias Analysis

Challenges in Lua

5.3 Skip-List Chains

5.4 Hoisting and Sinking

5.5 Trace Specialization

Chapter 6: Hardware and Cache Awareness

6.1 D-Cache Impact

Key Techniques

6.2 Code Layout and I-Cache

Cold Path Separation

6.3 Allocation and Memory Pressure

6.4 IR and Cache Behavior

6.5 Trace Execution and CPU Pipelines

Example:

6.6 Summary of Hardware-Aware Design

Appendix: Glossary and Further Reading

A. Glossary of Terms

B. Recommended Source Files in LuaJIT

C. Further Reading and References

Papers and Articles:

Tools for Exploration:

Hyperlinks