LuaJIT Source Code Primer

This is a compilation of key topics and concepts that I had to learn or revisit as I studied the internals of LuaJIT to understand how it sets itself apart. If you’re as rusty as I was, you might appreciate this recap.

DISCLAIMER: This article was generated by an LLM.

Chapter 1: Foundations Refreshed

1.1 The Role of a Compiler and JIT

Before diving into LuaJIT specifics, it’s useful to revisit what a compiler is and the distinctions between compilation strategies:

LuaJIT is a tracing JIT, a specific kind of JIT compiler that observes the runtime behavior of a program to identify hot paths—typically loops or frequently executed branches. These paths are recorded as linear sequences of operations called traces, which are then compiled to optimized machine code.

This approach yields very fast performance for dynamic languages, especially when the code inside loops stabilizes to predictable types and control flow.

1.2 Runtime Representation of Values

One of the key challenges in implementing dynamic languages like Lua is how to represent a variety of types in memory efficiently. LuaJIT solves this through NaN-tagging and pointer tagging.

NaN Tagging

NaN tagging leverages the IEEE-754 floating-point format. A double (64-bit floating point number) is laid out as follows:

The exponent field being all 1s (2047) indicates a NaN (Not a Number). The mantissa field in that case is used to distinguish different types of NaNs, and is ignored by normal floating-point operations.

LuaJIT exploits this: values that are not floating-point numbers are encoded using these NaN bit patterns. For example:

This means that every TValue (tagged Lua value) is a 64-bit word, and the type information is embedded directly within it, saving both time and space compared to storing types and values separately.

Example Encoding

Here’s a simplified encoding layout:

Tagged Object References

GC-tracked Lua objects (like tables, strings, and functions) are stored in heap-allocated memory. LuaJIT represents them using tagged pointers:

Tagged Pointer Memory Layout

Let’s say we have a table pointer 0x7ff8a0, which is 8-byte aligned. LuaJIT might represent it as:

0x7ff8a0 | TAG_TABLE   → 0x7ff8a5 (last 3 bits used to encode type)

To retrieve the raw pointer:

raw_ptr = tagged_ptr & ~0x7

To retrieve the type:

type_tag = tagged_ptr & 0x7

Pointer Tagging (More Generally)

Pointer tagging is not just about type discrimination. Other auxiliary data can be embedded, such as:

Pointer tagging is only viable when:

Unboxed Floating Point Numbers

In many dynamic language VMs, floating-point values must be heap-allocated (boxed) if they need to coexist with pointers in generic containers (like an array of TValue). This causes major performance problems due to extra indirection and memory pressure.

LuaJIT avoids this by not boxing floating point numbers:

This optimization works because of the NaN-tagging scheme—types are known from bits alone, not separate metadata.

1.3 Stack Slots vs Table Slots

Stack Slots

In LuaJIT, stack slots are used for function arguments, locals, and temporary values during bytecode execution.

This slot-based architecture supports efficient interpretation and JIT tracing, since register allocation can map slots directly to physical registers or known stack offsets.

Table Slots

Lua tables in LuaJIT are implemented with two parts:

  1. Array Part:
  2. Hash Part:
Skip-List Lookup Chains

This organization supports Lua’s dynamic, flexible table semantics while allowing optimizations like:


In the next chapter, we’ll cover call frames, function invocation internals, and how LuaJIT handles execution state across nested calls and VM boundaries.

If you want to deep-dive further into tagged value layouts or NaN-tagging implementation in C code, we can also annotate relevant structs from lj_obj.h in the LuaJIT source.


Chapter 2: The Call Stack and Execution Frames

2.1 Call Frames in LuaJIT

When a Lua function is invoked, the virtual machine needs to set up an activation record or call frame to store:

LuaJIT, being stack-based, does not use the C call stack for each Lua function. Instead, it manages its own virtual stack in a large preallocated memory block. This design gives it flexibility and eliminates many of the costs of native calls.

Components of a LuaJIT Call Frame

A typical call frame in LuaJIT contains:

This layout is maintained in the lua_State (thread) structure and manipulated using internal APIs.

Frame Linkage

Call frames are chained together to support nested calls. Each frame stores a pointer to the previous one. This structure allows for unwinding the stack during return or error handling.

This also plays a role in trace stitching, where a trace ends in a function call and another trace resumes when the called function returns.

Fast Function Calls

LuaJIT goes further: many function calls (especially monomorphic ones) are inlined or lowered into extremely fast jumps or tailcalls when traced. In some cases:


2.2 Slot-Based Execution Model

LuaJIT’s VM executes bytecode using a slot-based execution model, somewhat like a register machine.

This is significantly faster than a stack-based VM (like the original Lua interpreter), which constantly pushes and pops values.

Advantages of Slots

Slot Lifetime and Reuse

LuaJIT performs liveness analysis to determine how long a slot is needed. Slots can be reused aggressively once values go dead. This helps:


2.3 Call Frames and the JIT Compiler

When tracing, LuaJIT captures the semantics of what the Lua VM is doing—this includes function calls.

Trace Entry and Exit

A trace begins at a hot loop or a frequently executed function entry point. The trace records VM operations across function calls and captures the values in the relevant slots.

If the trace exits (e.g., a type mismatch or function call not matching the expected shape), control jumps back to the interpreter.

Inline Function Calls

The JIT compiler aggressively inlines small, predictable functions during trace generation. This means:

This design lets LuaJIT:


2.4 Vararg Functions and C Functions

Lua allows functions to take a variable number of arguments (...), and this has to be handled in call frame setup:

For C functions (i.e., Lua functions implemented in C via the C API), LuaJIT builds a special call frame that hands over execution to the C runtime. These are not traced, but the VM ensures the correct stack layout and GC state is preserved.


2.5 Return Values and Frame Rewinding

Returning from a function involves:

In traced code, LuaJIT knows ahead of time how many values are returned and to which slots. This allows elision of many dynamic checks or memory movement operations during trace execution.


In Chapter 3, we’ll dive into intermediate representations, and how LuaJIT uses a powerful SSA-based, orthogonal IR to drive its optimization pipeline. Let me know if you’d like me to continue with that now.

Chapter 3: Intermediate Representation in LuaJIT

3.1 What is an Intermediate Representation (IR)?

An intermediate representation (IR) is a data structure or code format used internally by a compiler to represent source code in a way that facilitates analysis and transformation. It serves as the bridge between the front end (parsing and bytecode interpretation) and the back end (code generation and optimization).

LuaJIT’s JIT compiler translates hot bytecode traces into a highly compact and powerful IR format that is:

Each IR instruction is a 64-bit structure representing a low-level operation.

3.2 SSA-Based IR

Static Single Assignment (SSA) means that each variable is assigned exactly once. Every new value gets a new unique name (or register).

This has several advantages:

In LuaJIT, the IR is generated in SSA form from the very beginning of trace recording.

Example (simplified IR in SSA)

i1 = add i0, 1
i2 = mul i1, 2

Each value (i1, i2) is computed once and never overwritten. You can always track a variable’s source unambiguously.

Phi Nodes

When control flow branches (e.g., in if or loops), SSA needs phi nodes to merge values:

if (x) then
    a = 1
else
    a = 2
end
b = a + 3

Becomes:

a1 = 1
...
a2 = 2
a3 = phi(a1, a2)
b = add a3, 3

LuaJIT’s IR includes phi nodes but often elides them when traces are linearized.

3.3 Orthogonal IR Design

An IR is orthogonal when the operations it defines are independent of operand types and instruction variants. This reduces the number of opcodes needed and simplifies optimization passes.

LuaJIT’s IR:

This allows for small, composable transformations and compact in-memory representation.

Each IR instruction contains:

3.4 Pointer-Free IR

LuaJIT’s IR avoids using direct memory pointers between instructions.

Instead, instructions are stored in a linear array and referenced by index. This:

Benefits

3.5 The FOLD Engine

FOLD is the name of LuaJIT’s constant folding and simplification engine. It is invoked during trace recording to simplify expressions on the fly.

FOLD operates on IR as it is being built:

Examples:

The FOLD engine avoids creating unnecessary IR, reducing trace size and speeding up compilation.

In many ways, it is a mini abstract interpreter, executing parts of the program at trace time to simplify them.


In the next chapter, we will explore how LuaJIT selects what code to trace, how it handles loops and branches, and the concepts of root traces, side traces, and natural loop first heuristics.

Chapter 4: Tracing and Region Compilation

4.1 What is Tracing Compilation?

Tracing JIT compilers do not compile whole functions or modules up front. Instead, they monitor code as it runs and identify hot paths—typically tight loops or frequently taken branches.

Once a hot path is detected, the VM:

  1. Starts recording a linear sequence of operations (a trace) as it executes bytecode.
  2. Transforms this trace into IR (Intermediate Representation).
  3. Applies optimizations and simplifications (e.g., constant folding, CSE).
  4. Emits native machine code.

The resulting native code is inserted into the runtime so that next time the same path is reached, it jumps directly to compiled code.

LuaJIT’s traces can span multiple function calls (when inlined), loop iterations, and conditions.

4.2 Root Traces and Side Traces

Root Trace

A root trace starts at a top-level hot loop or conditional branch. This is the first trace compiled for a region.

Example:

for i = 1, 1000 do
  sum = sum + arr[i]
end

If this loop becomes hot, LuaJIT begins a root trace starting at the loop header.

Side Trace

A side trace branches off from a guard failure in an existing trace.

Suppose the trace expects arr[i] to be a number, but at iteration 500, it becomes a string. The current trace exits, and LuaJIT starts recording a new trace starting from the failure point. This side trace handles the “exceptional” case.

Guards are inserted throughout traces to verify assumptions made during tracing (types, control flow, etc.). If any guard fails:

This trace tree architecture allows LuaJIT to specialize paths and recover gracefully from dynamic behavior.

4.3 Trace Selection: Natural Loop First (NLF)

LuaJIT uses a natural loop first (NLF) strategy to guide what to trace.

Natural Loops

A natural loop is a block of code with:

These loops are ideal tracing candidates because:

NLF ensures tracing begins at the most profitable regions first—inner loops and stable conditional blocks.

Trace Thresholds

The VM keeps counters at loop headers. When a counter exceeds a hotness threshold, tracing is initiated.

You can tweak these thresholds with LuaJIT’s internal options (e.g., jit.opt.start(“hotloop=50”)).

Outer Traces and Tail Calls

NLF also helps avoid tracing large, monolithic functions. By breaking execution into traces focused on loops, LuaJIT can apply aggressive optimizations locally without overgeneralizing.

Tail-recursive functions are often turned into self-recursive traces without growing the trace tree.

4.4 Guards and Control Flow

Guards are conditions inserted into the compiled trace to check assumptions. Examples:

When a guard fails, LuaJIT exits the trace and resumes interpretation or jumps to a side trace.

This design is crucial:

4.5 Trace Linking and Stitching

Once multiple traces are compiled, LuaJIT can link them at runtime:

Trace stitching allows the program to run entirely in compiled mode across complex control flow.

Benefits


In Chapter 5, we’ll study how LuaJIT performs optimization, including CSE, DSE, alias analysis, and the subtle trade-offs involved in a trace compiler.

Chapter 5: Optimization Strategies

One of LuaJIT’s standout features is its highly optimized trace compiler. The IR traces recorded during runtime are subjected to a suite of optimization passes to reduce runtime overhead, eliminate redundancy, and exploit predictable behavior. This chapter dives into some of the core strategies involved.

5.1 Classical Compiler Optimizations in LuaJIT

Common Subexpression Elimination (CSE)

CSE removes redundant computations by identifying when an expression has already been evaluated and reusing its result.

Example:

x = a + b
...
y = a + b

Becomes:

x = a + b
y = x

In IR terms, if two instructions have the same opcode and operands, and neither operand has changed, the second can reuse the first’s result. LuaJIT’s FOLD engine performs this early during IR construction.

Benefits:

Dead Store Elimination (DSE)

DSE removes writes to memory or stack slots that are later overwritten without being read.

Example:

x = 1
x = 2  -- the previous write is dead

LuaJIT can eliminate the first assignment entirely, recognizing that its value is never observed.

Dead stores arise from:

Constant Folding and Propagation

This is handled by the FOLD engine:

Benefits:

5.2 Alias Analysis

Lua is a dynamic language with mutable tables, closures, and upvalues. To optimize memory access safely, the compiler needs to know whether two variables may refer to the same memory. This is the job of alias analysis.

Challenges in Lua

LuaJIT performs conservative alias analysis:

For example, two different table fields might alias if table identity isn’t known, so LuaJIT may insert guards or fall back to the interpreter.

To avoid pessimism, LuaJIT specializes traces:

This lets safe, fast paths dominate while preserving correctness.

5.3 Skip-List Chains

Lua tables are backed by both an array part and a hash part. The hash part may be implemented using skip-list chains, which allow fast key lookup via multi-level linked lists.

Advantages over linear probing:

LuaJIT specializes traces on hot table accesses by:

This lets common field accesses (like obj.x) compile into single load instructions after guards.

5.4 Hoisting and Sinking

LuaJIT may hoist:

Sinking is useful for memory allocations:

5.5 Trace Specialization

Every trace is specialized based on the runtime state during recording:

This allows highly optimized code generation but introduces fragility—if any assumption breaks, guards force a trace exit.

In practice, traces are very stable for numerical loops, pure functions, and object method calls.


In Chapter 6, we’ll explore how these design decisions impact hardware-level performance, including data cache behavior, code layout, and memory access patterns.

Chapter 6: Hardware and Cache Awareness

LuaJIT’s performance isn’t solely due to clever compilation strategies and IR design—it also excels by being cache-friendly and minimizing memory pressure. This chapter focuses on how LuaJIT interacts with the underlying hardware, particularly the data cache (D-cache), instruction cache (I-cache), and memory locality.

6.1 D-Cache Impact

Modern CPUs access memory through multiple levels of cache. The data cache (D-cache) stores recently accessed data, and cache misses can cost hundreds of CPU cycles.

LuaJIT is optimized to:

Key Techniques

This design reduces the chance of cache thrashing and makes speculative loads (like from traces) more likely to hit.

6.2 Code Layout and I-Cache

The instruction cache (I-cache) stores recently executed machine code. If JIT-compiled code is large, complex, or fragmented, I-cache misses can degrade performance.

LuaJIT mitigates this by:

Cold Path Separation

Rarely executed error checks, metatable fallbacks, or guards that almost never fail are emitted to cold sections of the trace.

This:

6.3 Allocation and Memory Pressure

Memory allocation has a direct impact on performance and cache usage.

LuaJIT’s object allocator:

The garbage collector (GC) is tuned for low overhead:

This GC-awareness ensures that allocations do not break cache locality or evict valuable data during hot path execution.

6.4 IR and Cache Behavior

The IR is stored as a linear array of 64-bit instructions. Since each instruction is referenced by a numeric ID (not a pointer), the memory layout is dense and friendly to the D-cache.

Benefits:

6.5 Trace Execution and CPU Pipelines

JIT-compiled traces are optimized for modern CPU execution pipelines:

Example:

A simple numeric loop in LuaJIT compiles to a short, branchless loop that:

These traces can execute entirely out of the CPU’s L1 cache with no pipeline stalls.

6.6 Summary of Hardware-Aware Design

Design Element Hardware Benefit
Slot-based VM Linear access = D-cache friendly
SSA-based IR Reduced memory pressure
Guard separation Smaller hot paths = I-cache win
Dense IR layout Prefetching, no pointer chasing
Trace specialization Fewer branches = pipeline win
Cold path separation Reduces I-cache pollution

In the Appendix, we’ll provide a glossary of terms and references to specific LuaJIT source files and further reading.

Appendix: Glossary and Further Reading

A. Glossary of Terms


These files are critical to understanding LuaJIT’s internals:


C. Further Reading and References

Papers and Articles:

Tools for Exploration:

Continue