Design & Format Specification
Architecture, binary layout, and internal data structures of Mosaic v1.
File Format Layout
A Mosaic file consists of four sections, written sequentially:
Reading starts from the footer (last 32 bytes), which provides absolute offsets to locate the schema block and row group index.
Columnar-Bucket Hybrid
Mosaic is a columnar-bucket hybrid format. Columns are sorted by name and evenly distributed into buckets using range-based assignment:
bucket_id = sorted_position * num_buckets / num_columns
Within each bucket, data is stored column-oriented and independently compressed. This design enables efficient projection pushdown at bucket granularity — reading 10 columns out of 10,000 only decompresses the buckets that contain those 10 columns.
Range-based assignment ensures that columns with similar name prefixes
(e.g., sensor_temp_1, sensor_temp_2) land in the same bucket,
improving both compression ratio and projection locality.
The default is 100 buckets, automatically clamped to min(num_columns, 100).
The bucket assignment is deterministic and derived from the sorted column order —
it is not stored in the file.
Encoding Strategy
Each column within a bucket is independently encoded. The writer selects the most compact encoding for each column:
| Encoding | Tag | When Used | Storage |
|---|---|---|---|
| PLAIN | 0 | Fallback for everything else | Raw values (fixed-width or varint-prefixed) + null bitmap |
| CONST | 1 | All non-null values are identical | One value + null bitmap |
| DICT | 2 | Number of distinct values ≤ 255 and total dict size ≤ 32 KB | Dictionary + bit-packed indices + null bitmap |
| ALL_NULL | 3 | Every value in the column is null | Zero bytes (no data, no bitmap) |
Column Encoding Selection
The encoding for each column is chosen automatically during writing based on value distribution and cost:
- ALL_NULL: 0 non-null values
- CONST: exactly 1 distinct non-null value (any number of nulls allowed)
- DICT: 2–255 distinct non-null values, and the
dictionary-encoded size is smaller than plain — the writer compares
varint(numEntries) + sum(entryBytes) + ceil(nonNullCount * bitWidth / 8)against the raw value buffer size - PLAIN: 256+ distinct values, dict tracking was abandoned, or dict encoding would be larger than plain
CONST detection is independent of dictionary tracking — it uses a lightweight byte comparison against the first non-null value, so it works for all types and value sizes (including long strings).
Dictionary encoding works for all data types including variable-width types (VARCHAR, VARBINARY, DECIMAL). Variable-width dictionary tracking is bounded by a configurable cumulative byte budget (default 32 KB) and abandoned when cardinality exceeds 255 or total dictionary entry bytes exceed the budget.
Bit-packed Dictionary Indices
Dictionary indices are bit-packed using bitWidth = ceil(log2(numEntries)) bits per
non-null cell, packed LSB-first within each byte. The reader derives bitWidth from
numEntries (already stored in dict metadata).
Examples: 2 distinct values → 1 bit/cell, 4 → 2 bits, 16 → 4 bits, 256 → 8 bits.
Bucket Internal Structure
Each bucket stores column data in one of two modes, chosen automatically based on the uncompressed data size. The mode determines how compression is applied.
Monolithic Mode
When the average column page size is smaller than 32 KB (configurable via
page_size_threshold), the entire bucket is compressed as a single zstd block.
Individual column pages that are too small yield poor zstd compression ratios,
so monolithic compression is more efficient in this case.
Paged Mode
When the average column page size is ≥ 32 KB, the bucket switches to paged mode.
The bucket begins with a fixed-length page directory followed by self-describing,
independently compressed column slots. The directory size is deterministic from the schema
(num_columns_in_bucket × 4 bytes), enabling projection queries to read only
the target columns' data with exactly 2 range-read operations on remote storage.
Page Directory
The directory is an array of num_columns_in_bucket entries, each a 4-byte u32
(little-endian) representing the total on-disk slot size for that column. A value of 0
means the column is ALL_NULL and has no on-disk data. The directory size is deterministic:
num_columns_in_bucket × 4 bytes, computable from the schema alone.
Column Slot Format
Each non-ALL_NULL column has a slot on disk immediately after the directory:
On-disk slot:
uncompressed_size (varint, uncompressed prefix)
compressed_data (zstd compressed page_content)
page_content (after decompression):
encoding (1 byte: PLAIN=0, CONST=1, DICT=2)
flags (1 byte: bit 0 = has_nulls)
[meta] (encoding-specific, see below)
[data] (null bitmap if has_nulls, then column data)
Page Content by Encoding
| Encoding | On-Disk Slot? | page_content layout |
|---|---|---|
| ALL_NULL | No (size=0) | — |
| CONST (no nulls) | Yes (tiny) | encoding + flags + const_value |
| CONST (has nulls) | Yes | encoding + flags + const_value + null_bitmap |
| DICT | Yes | encoding + flags + dict_table + [null_bitmap] + bit-packed indices |
| PLAIN | Yes | encoding + flags + [null_bitmap] + raw column data |
Projected Read Path
- Compute
dir_size = num_columns_in_bucket × 4(known from schema) - Range-read the directory from
bucket_offset - For each projected column, compute slot offset via prefix-sum of directory entries
- Range-read only the projected columns' slots (merge adjacent slots into a single IO)
- For each slot: parse
uncompressed_sizevarint, thenzstd::decompress - Parse
page_content: encoding, flags, meta, data → build column reader
Monolithic vs Paged Signaling
Each bucket in the row group index is described by a pair
(compressed_size, bulk_decompress_size).
This pair encodes three layout variants with zero additional bytes:
| Condition | Layout | Meaning |
|---|---|---|
compressed_size == 0 |
Empty | No data on disk for this bucket; skip entirely. |
compressed_size > 0 && bulk_decompress_size > 0 |
Monolithic |
The on-disk blob is a single compressed block.
bulk_decompress_size is the decompressed size
(used to allocate the output buffer before decompression).
|
compressed_size > 0 && bulk_decompress_size == 0 |
Paged |
The on-disk content is
[directory (num_cols × u32le slot sizes)]
followed by per-column compressed slots.
Each slot is independently decompressible.
|
This encoding is unambiguous: a non-empty monolithic bucket always has
bulk_decompress_size > 0 because a decompressed payload
cannot be zero bytes. The combination
compressed_size == 0 && bulk_decompress_size != 0
is invalid and must be rejected by the reader.
Validation Invariants
- Paged buckets require
compression == ZSTD. - The paged directory size is
num_cols × 4bytes;dir_size + sum(slot_sizes) == compressed_sizemust hold exactly. - All varint-encoded sizes (
compressed_size,bulk_decompress_size) and u32 LE slot sizes must fit inu32; values exceedingu32::MAXare rejected at write time.
Compression
Both bucket data and the schema block support compression:
| ID | Name | Description |
|---|---|---|
| 0 | None | No compression |
| 1 | Zstd | Zstandard compression (default level 1) |
In monolithic mode, compression is applied to the entire bucket as one block. In paged mode, the page directory is uncompressed (fixed-length, enabling direct offset computation), while each column slot is independently zstd-compressed. Paged mode is only used when the compression method is Zstd.
Row Groups
Large files are split into row groups to bound memory usage during writing.
Each row group contains up to row_group_max_size bytes of uncompressed bucket data
(default: 256 MB). The row group index in the file footer records offsets and sizes for each
bucket in each row group, enabling random access to any row group.
Footer (32 bytes, big-endian)
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 8 | indexOffset | Absolute offset of Row Group Index |
| 8 | 8 | schemaBlockOffset | Absolute offset of Schema Block |
| 16 | 4 | numBuckets | Total number of buckets |
| 20 | 4 | numRowGroups | Total number of row groups |
| 24 | 1 | compression | 0 = none, 1 = zstd |
| 25 | 1 | version | Format version (currently 1) |
| 26 | 2 | (reserved) | Padding, set to 0 |
| 28 | 4 | magic | MOSA (0x4D4F5341) |
Row Group Index
Varint-encoded, only non-empty buckets are stored. For each row group:
varint numRows
varint nonEmptyCount
repeated nonEmptyCount times:
varint bucketId
8 bytes bucketOffset (big-endian, absolute file offset)
varint compressedSize (total bytes: monolithic blob or directory + column slots)
varint bulkDecompressSize (> 0 = monolithic, = 0 = paged)
--- Column Statistics (appended after bucket entries) ---
varint numStats (0 if no stats configured)
repeated numStats times:
varint columnIndex (global column index)
varint nullCount
[if nullCount < numRows]:
value minValue (serialized using standard value encoding)
value maxValue (serialized using standard value encoding)
Empty buckets (no data) are omitted entirely, saving space for sparse schemas.
Column Statistics
Mosaic supports optional per-column min/max statistics at row group granularity, enabling filter pushdown: query engines can skip entire row groups whose value range does not overlap with a filter predicate.
- Opt-in: Statistics are only collected for columns specified in
WriterOptions.stats_columns. By default, no stats are built. - Zero overhead when disabled: When no stats columns are configured,
each row group adds only 1 byte (a varint
0) to the row group index. - Supported types: All orderable types — numeric (BOOLEAN through DOUBLE), DATE, TIME, TIMESTAMP, compact DECIMAL, and string types (CHAR, VARCHAR, STRING).
- Storage: Stats are stored inline in the row group index after each row group's bucket entries.
Filter Pushdown
Query engines can use column statistics to skip entire row groups whose min/max range does not
overlap with a filter predicate. For example, a filter age > 50 can skip any
row group where max(age) ≤ 50.
Schema Block
Prefixed with a 4-byte big-endian int (uncompressed size), followed by the schema data (compressed with the file's compression method).
Columns are serialized in name-sorted order. Column names are compressed using one of two encodings, chosen dynamically by the writer based on which produces smaller output:
- Front coding (mode 0): Each name shares a prefix with the previous name; only the suffix is stored.
- BPE + front coding (mode 1): Byte Pair Encoding is applied first to
compress repeated substrings across column names (e.g.,
_status,_value), then front coding is applied to the BPE-encoded names. BPE uses token bytes 0x80–0xFF (up to 128 merge rules), and is only applicable when all column names are ASCII-only.
Schema Block Layout
varint numColumns
varint numBuckets
1 byte nameEncoding (0 = front coding, 1 = BPE + front coding)
--- if nameEncoding == 1 (BPE) ---
varint numRules
repeated numRules times:
1 byte left (left token of merge rule)
1 byte right (right token of merge rule)
--- per column (repeated numColumns times, name-sorted order) ---
varint sharedPrefixLen (bytes shared with previous column name)
varint suffixLen (bytes of new suffix)
bytes suffix (suffixLen bytes, raw or BPE-encoded)
TypeDescriptor
--- original column order (delta + zigzag encoded) ---
repeated numColumns times:
zigzag_varint delta (sorted position delta from previous; first relative to 0)
The first column has sharedPrefixLen = 0. To reconstruct a column name,
take the first sharedPrefixLen bytes from the previous name and append
the suffix. If BPE is used, decode the reconstructed byte sequence by recursively
expanding tokens ≥ 0x80 using the merge rules.
Columns are stored on disk in name-sorted order for front-coding compression. The original (user-defined) column order is preserved via a delta+zigzag-encoded permutation at the end of the schema block. When reading without an explicit projection, columns are returned in their original input order. The delta encoding produces long runs of +1 for locally-ordered column groups, which compress extremely well under zstd.
TypeDescriptor
1 byte typeId
1 byte nullable (0 = not null, 1 = nullable)
[type-specific params]
| typeId | Type | Params |
|---|---|---|
| 0 | BOOLEAN | (none) |
| 1 | TINYINT | (none) |
| 2 | SMALLINT | (none) |
| 3 | INTEGER | (none) |
| 4 | BIGINT | (none) |
| 5 | FLOAT | (none) |
| 6 | DOUBLE | (none) |
| 7 | DATE | (none) |
| 8 | CHAR | varint length |
| 9 | VARCHAR | varint length |
| 10 | STRING | (none) — VARCHAR with MAX_LENGTH |
| 11 | BINARY | varint length |
| 12 | VARBINARY | varint length |
| 13 | BYTES | (none) — VARBINARY with MAX_LENGTH |
| 14 | DECIMAL | varint precision, varint scale |
| 15 | TIME | varint precision |
| 16 | TIMESTAMP | varint precision |
| 17 | TIMESTAMP_LTZ | varint precision, varint timezoneLength, bytes timezone |
| 18 | ARRAY | varint nameLength, bytes name (element field name), TypeDescriptor (recursive element type) |
| 19 | MAP | varint entriesNameLen + bytes entriesName, varint keyNameLen + bytes keyName + TypeDescriptor (key), varint valNameLen + bytes valName + TypeDescriptor (value). Sorted MAP is not supported; always unsorted. |
ROW, VARIANT, and BLOB are not yet supported.
Value Serialization
Values are serialized in the same format for PLAIN data, CONST metadata, and DICT entries:
| Type | Encoding |
|---|---|
| BOOLEAN | 1 byte (0 or 1) |
| TINYINT | 1 byte |
| SMALLINT | 2 bytes big-endian |
| INTEGER / DATE / TIME | 4 bytes big-endian |
| BIGINT | 8 bytes big-endian |
| FLOAT | 4 bytes IEEE 754 (big-endian) |
| DOUBLE | 8 bytes IEEE 754 (big-endian) |
| DECIMAL (compact, precision ≤ 18) | 8 bytes big-endian (unscaled long) |
| DECIMAL (large, precision > 18) | varint length + unscaled BigInteger bytes |
| TIMESTAMP (precision ≤ 3) | 8 bytes (epoch millis, big-endian) |
| TIMESTAMP (precision 4–6) | 8 bytes (epoch micros, big-endian) |
| TIMESTAMP (precision > 6) | 8 bytes (epoch millis) + 4 bytes (nanos of millis) |
| CHAR / VARCHAR / STRING | varint length + UTF-8 bytes |
| BINARY / VARBINARY / BYTES | varint length + raw bytes |
| ARRAY | Flattened columnar: lengths (INT32) + values column (see ARRAY Type Storage below) |
| MAP | Flattened columnar: lengths (INT32) + keys column + values column (see MAP Type Storage below) |
Readers expose TIMESTAMP precision > 6 as Arrow Timestamp(Nanosecond, timezone). This keeps the 12-byte physical encoding unchanged, but precisions 7 and 8 are normalized to Arrow nanosecond units in the external schema because Arrow timestamp units do not preserve decimal timestamp precision separately. Existing 12-byte timestamp values outside Arrow's signed 64-bit epoch-nanosecond range cannot be represented through this Arrow API; readers return InvalidData for those values, and legacy Struct timestamp writes reject them before writing.
ARRAY Type Storage
ARRAY columns use a flattened columnar storage layout. Each ARRAY column is decomposed into physical columns within the same bucket — a lengths column and a values column — both first-class columns that benefit from standard column encoding (DICT, CONST, PLAIN).
Decomposition
An ARRAY<T> column with N rows is stored as:
- Lengths column (INT32, N entries): the number of elements in each array.
Null arrays are represented by a null in the lengths column (contributing 0 elements).
Empty arrays (
[]) have length 0 and are non-null. - Values column (type T, M entries): all elements from all rows flattened into a single contiguous column, where M = sum of all non-null lengths. Element-level nulls are tracked by this column’s own null bitmap.
Both columns independently go through the standard encoding selection (DICT, CONST, PLAIN, ALL_NULL), enabling dictionary compression of element values across all arrays in the column.
Example
Input rows: [1, 2, 3], null, [1, 2], []
Lengths column (INT32): 3, null, 2, 0 ← standard INT32 encoding (DICT/CONST/PLAIN)
Values column (INT32): 1, 2, 3, 1, 2 ← standard INT32 encoding (DICT: {1→0, 2→1, 3→2})
Nested Arrays
ARRAY<ARRAY<INT>> is stored recursively. The outer values column
is itself an ARRAY<INT>, which decomposes into its own lengths + values pair.
All leaf INT values across all nesting levels end up in a single INT32 column,
sharing one dictionary across the entire column.
ARRAY<ARRAY<INT>> with rows: [[1,2], [3]], [[1,2]]
Outer lengths (INT32): 2, 1 ← 2 inner arrays, 1 inner array
Inner lengths (INT32): 2, 1, 2 ← element counts of inner arrays
Leaf values (INT32): 1, 2, 3, 1, 2 ← all INTs in one column, shared DICT
Bucket Internal Format
ARRAY columns are expanded into physical columns within the same bucket.
An ARRAY<T> becomes two physical columns: a lengths column (INT32)
and a values column (type T). Both are first-class columns in the bucket —
they share the same encoding flags, null bitmaps, and column data sections
as all other columns. There is no sub-bucket or nested container.
Monolithic Bucket with ARRAY Columns
varint numPrimary (number of logical/primary columns = N)
varint numChildren (number of child value columns = C; 0 if no ARRAY)
repeated C times:
varint childElementCount (total element count for each child column)
[encoding flags: 2 bits × (N + C) columns]
[has-nulls flags: 1 bit × (N + C) columns]
[CONST metadata for all N + C columns]
[DICT metadata for all N + C columns]
[null bitmaps: primary columns use ceil(numRows/8), child columns use ceil(childElementCount/8)]
[column data for all N + C columns]
Paged Bucket with ARRAY Columns
The page directory includes entries for all N + C physical columns. Each column (including child value columns) has its own independently compressed slot. A fixed-size child header precedes the directory:
u16 numChildren (LE)
repeated numChildren times:
u32 childElementCount (LE)
[directory: (N + C) × u32 LE slot sizes]
[column slots: each independently compressed, including child columns]
Statistics
ARRAY and MAP columns do not support min/max statistics (no meaningful ordering).
MAP Type Storage
MAP columns use the same flattened columnar approach as ARRAY.
A MAP<K, V> column is decomposed into three physical columns
within the same bucket:
- Lengths column (INT32, N entries): the number of key-value pairs in each map.
- Keys column (type K, M entries): all keys flattened across all rows.
- Values column (type V, M entries): all values flattened across all rows.
All three columns independently benefit from standard column encoding (DICT, CONST, PLAIN, ALL_NULL). Null maps are represented by a null in the lengths column. Empty maps have length 0.
Example
MAP<INT, UTF8> with rows: {1:"a", 2:"b"}, null, {3:"c"}
Lengths column (INT32): 2, null, 1
Keys column (INT32): 1, 2, 3 ← shared DICT across all maps
Values column (UTF8): "a", "b", "c" ← shared DICT across all maps
Design Rationale: Comparison with Parquet and ORC
There are two mainstream approaches to storing nested ARRAY types in columnar formats. Mosaic follows the ORC-style approach (lengths + child columns) rather than the Parquet/Dremel-style approach (repetition/definition levels).
Mosaic vs ORC
Both Mosaic and ORC decompose ARRAY<T> into a lengths stream and a
child column. The key differences:
| Aspect | Mosaic | ORC |
|---|---|---|
| Lengths encoding | DICT / CONST / PLAIN — all-same-length arrays use CONST (near zero bytes) | Run-Length Encoding (RLE v1/v2) |
| Child column encoding | DICT / CONST / PLAIN — shared dictionary across all arrays | DICT / DIRECT / RLE — same cross-array sharing |
| Column placement | Lengths and values are physical columns within the same bucket | Lengths and values are independent streams within the same stripe |
| Null representation | Lengths column null bitmap (shared infrastructure) | Separate PRESENT stream (Boolean RLE) |
The designs are structurally equivalent. Mosaic’s CONST encoding for lengths can be more compact than ORC’s RLE when all arrays have the same length (a common case in feature vectors and fixed-size embeddings).
Why Not Parquet’s Repetition/Definition Levels?
Parquet uses the Dremel encoding: each leaf column carries repetition and definition
level arrays that encode the full nesting structure. For ARRAY<primitive>,
this means one physical leaf column; for nested types like ARRAY<MAP<K,V>>,
each leaf (K and V) gets its own column with deeper rep/def levels. The physical column
count equals the number of leaf fields, not the number of ARRAY/MAP columns.
This approach has some advantages but significant disadvantages:
- Implementation complexity: The shredding (decomposition) and assembly (reconstruction) algorithms for rep/def levels are substantially more complex than the lengths + child approach, especially for multi-level nesting. Correct handling of nulls at different nesting levels requires careful tracking of definition level thresholds.
- Overhead for common cases: Most real-world schemas use at most 1–2 levels of ARRAY nesting. The rep/def approach adds per-value overhead (two extra integers per leaf value) that only pays off at deep nesting levels (≥ 3), which are rare in practice.
- No encoding benefit: Rep/def levels themselves need encoding (typically RLE or bit-packing). The lengths + child approach achieves equivalent compression by applying standard column encodings to both the lengths column (which captures the same structural information) and the values column.
The ORC-style approach was chosen for its simplicity, debuggability, and natural fit with Mosaic’s bucket architecture where each column is independently encoded and compressed.
Varint Encoding
Unsigned 32-bit integers are encoded as 1–5 bytes using LEB128. Each byte contributes 7 data bits; the high bit indicates whether more bytes follow (1 = more, 0 = last byte).
0 → 0x00 (1 byte)
127 → 0x7F (1 byte)
128 → 0x80 0x01 (2 bytes)
16383 → 0xFF 0x7F (2 bytes)
16384 → 0x80 0x80 0x01 (3 bytes)
Limitations
- Complex types MULTISET and ROW are not yet supported. ARRAY and MAP are supported with flattened columnar storage.
- Mosaic format is designed for wide tables and may not be efficient for narrow tables with few columns.