Backends¶

versionable provides the option of several backends, each targeting a different trade-off between human-readability, interoperability with other tools, and performance with large binary data.

You never have to instantiate a backend directly — versionable picks the right one automatically based on the file extension you pass to save() / load(). If you use .json, you get JSON. If you use .toml, you get TOML. The same object can be saved and loaded with different backends just by changing the filename extension, which makes it easy to migrate between formats or write tests against a lighter-weight backend than you use in production.

Extension	Backend	Best for
`.yaml`, `.yml`	`YamlBackend`	Config files, data-science workflows
`.json`	`JsonBackend`	Simple data, interoperability
`.toml`	`TomlBackend`	Human-editable config files
`.h5`, `.hdf5`	`Hdf5Backend`	Large numpy arrays, lazy loading

All backends store the same schema metadata (object, version, hash inside the __versionable__ envelope) alongside your data, so load() can validate the schema and apply migrations regardless of which backend wrote the file.

Feature comparison¶

Feature	YAML	JSON	TOML	HDF5
Human-readable	Yes	Yes	Yes	No
`None` / `null`	Yes	Yes	No	Yes
Comment-out defaults	Yes	No	Yes	No
Nested objects	Yes	Yes	Yes	Yes
Large numpy arrays	Slow	Slow	Slow	Fast
Lazy loading	No	No	No	Yes
Hand-editable	Good	Fair	Best	No
External tool support	Wide	Wide	Good	Niche

YAML¶

YAML is a good choice when you want human-readable files with support for comments (added by hand), null values, and a syntax that is already familiar in data-science and DevOps workflows. Unlike TOML, YAML handles None natively — fields with None survive the round-trip without any special treatment.

versionable.save(config, "config.yaml")
loaded = versionable.load(SensorConfig, "config.yaml")

Produces:

name: probe-A
sampleRate_Hz: 120000
channels:
  - 0
  - 1
  - 2
__versionable__:
  object: SensorConfig
  version: 1
  hash: 9d6951

Both .yaml and .yml extensions are supported.

Metadata is stored in a __versionable__ mapping at the end of the file — your data comes first, schema metadata stays out of the way.

Missing fields¶

Any field absent from the file is filled in from the dataclass default on load. This means older files with fewer fields load cleanly as new fields are added to the schema (as long as those fields have defaults).

Comment-Out Defaults¶

Pass commentDefaults=True when saving to comment out fields whose values match the class default. This is useful for config files where you want users to see all available options without all of them being “active”:

versionable.save(config, "config.yaml", commentDefaults=True)

name: probe-A
sampleRate_Hz: 120000
# channels:
# - 0
# - 1
# - 2
__versionable__:
  object: SensorConfig
  version: 1
  hash: 9d6951

JSON¶

JSON is the most common choice when the file will be read by tools outside of Python — a web service, a JavaScript front-end, or a data pipeline that expects a standard format. It handles all primitive types, lists, and nested objects, and the output is human-readable even if not particularly easy to hand-edit.

versionable.save(config, "config.json")
loaded = versionable.load(SensorConfig, "config.json")

The output includes schema metadata alongside the data:

{
  "__versionable__": {
    "object": "SensorConfig",
    "version": 1,
    "hash": "9d6951"
  },
  "name": "probe-A",
  "sampleRate_Hz": 120000,
  "channels": [0, 1, 2]
}

TOML¶

TOML is the best choice for configuration files that users will open and edit by hand. The format is designed to be obvious at a glance, supports comments (via commentDefaults), and maps cleanly to nested sections. If your dataclass represents application settings that ship with the software and users are expected to tweak, prefer TOML over JSON.

versionable.save(config, "config.toml")
loaded = versionable.load(SensorConfig, "config.toml")

Produces human-readable TOML:

name = "probe-A"
sampleRate_Hz = 120000
channels = [0, 1, 2]

[__versionable__]
object = "SensorConfig"
version = 1
hash = "9d6951"

Fields come first deliberately — if a user opens the file to hand-edit a value, the data is right at the top and the schema metadata stays out of the way at the bottom.

Missing fields and None values¶

TOML is flexible about missing keys — any field absent from the file is silently filled in from the dataclass default on load. This means you can hand-edit a config file and freely delete any line whose value you want to reset to default, and it will just work. It also means older files with fewer fields load cleanly as new fields are added to the schema (as long as those fields have defaults).

We recommend that every field in a class saved to TOML defines a default value. Required fields (no default) work fine for new files, but they become a liability the moment a file is hand-edited, partially written, or migrated from an older schema version — any of which can leave the field absent, causing load to fail.

The one case to be careful about is None. TOML has no native null type, so a field holding None is omitted on save — the same as a missing key. On load it is restored from the dataclass default, which is fine if a default exists. But for a required field (no default), None at save time means the field disappears from the file and cannot be recovered on load. JSON and YAML handle this safely because null is a first-class value that survives the round-trip. If your schema has required optional fields that may genuinely be None, prefer YAML or JSON.

Nested Versionable objects become native TOML tables. For example, given:

@dataclass
class RetryPolicy(Versionable, version=1, hash="f907a9"):
    retries: int = 3
    backoff_s: float = 1.0

@dataclass
class WorkerConfig(Versionable, version=1, hash="8bdfa7"):
    name: str = "worker"
    retry: RetryPolicy = field(default_factory=RetryPolicy)

The saved TOML looks like:

name = "worker"

[__versionable__]
object = "WorkerConfig"
version = 1
hash = "8bdfa7"

[retry]
retries = 3
backoff_s = 1.0

[retry.__versionable__]
object = "RetryPolicy"
version = 1
hash = "f907a9"

Each nested Versionable carries its own __versionable__ sub-table — the same shape as the root envelope.

Comment-Out Defaults¶

Pass commentDefaults=True to comment out fields whose values match the class default. This is useful for config files where you want users to see all available options without all of them being “active”:

@dataclass
class SensorPreset(Versionable, version=1, hash="6f2809"):
    name: str = "sensor"
    sampleRate_Hz: int = 48000
    enabledChannels: list[int] = field(default_factory=lambda: [0, 1])

preset = SensorPreset(name="probe-A")  # only override name
versionable.save(preset, "preset.toml", commentDefaults=True)

name = "probe-A"
# sampleRate_Hz = 48000
# enabledChannels = [0, 1]

[__versionable__]
object = "SensorPreset"
version = 1
hash = "6f2809"

Fields at their default render as #-prefixed lines; users uncomment any line to override that default.

Note: hand-added comments in a TOML file are wiped on the next save() — the file is regenerated from the parsed Python dict, which doesn’t carry comments. Round-trip preservation of user-added comments is planned for a follow-up release.

HDF5¶

HDF5 is the right choice when your dataclasses contain large numpy arrays — recordings, images, simulation outputs, or any dataset where reading the whole file into memory upfront would be slow or wasteful. Unlike JSON and TOML, HDF5 stores arrays as binary compressed datasets, so a 100 MB array saves and loads in a fraction of the time it would take as text.

The HDF5 backend depends on h5py, which in turn requires the HDF5 C library — a non-trivial native dependency that adds significant installation overhead. It is therefore kept as an optional extra so that projects using only JSON or TOML don’t pay that cost.

Installation:

On most platforms (macOS, Windows, Linux x86_64), pip ships a pre-built wheel:

pip install "versionable[hdf5] @ git+https://github.com/hendrickmelo/versionable.git"

On Linux ARM or systems with an older glibc (e.g. RHEL 7), no pre-built wheel is available and pip will fall back to building from source. Install the HDF5 system library first:

sudo apt install libhdf5-dev   # Debian/Ubuntu
sudo yum install hdf5-devel    # RHEL/CentOS

Then run the pip install above. Users on conda-based environments can skip this — conda manages the HDF5 C library as a first-class package.

import numpy as np
import numpy.typing as npt
from dataclasses import dataclass
import versionable
from versionable import Versionable

@dataclass
class Recording(Versionable, version=1, hash="..."):
    name: str
    sampleRate_Hz: int
    data: npt.NDArray[np.float64]

rec = Recording(name="capture-1", sampleRate_Hz=240000, data=np.random.rand(1_000_000))
versionable.save(rec, "recording.h5")

Every field maps to a native HDF5 construct:

Python type	HDF5 representation
`int`, `float`, `bool`, `str`	Scalar attribute
`np.ndarray`	Dataset (compressed)
`list[int]`, `list[float]`, `list[str]`, `list[bool]`	1-D dataset
`list[np.ndarray]`	Group of integer-keyed datasets
`dict[str, np.ndarray]`	Group of named datasets
Nested `Versionable`	Subgroup with `__versionable__` metadata group
`list[Versionable]`	Group of integer-keyed subgroups
`None`	`h5py.Empty` attribute
`Enum`	Attribute (stores `.value`)
Converted types (datetime, Path, etc.)	Attribute (converter output)

Metadata (object, version, hash) is stored as attributes on a __versionable__ child group at the root and inside each nested Versionable subgroup. This distinguishes Versionable groups from plain collection groups. The format attribute is reserved in this group for future versionable file format versioning.

Files are readable with h5dump, HDFView, MATLAB, or any HDF5-compatible tool. Reconstructing exact Python types (e.g., distinguishing list[float] from np.ndarray) requires the class’s type annotations.

Compression¶

By default, array datasets are compressed with gzip (level 4) for maximum compatibility across tools (MATLAB, HDFView, h5py without plugins). You can change the algorithm and level per-save by passing a compression kwarg:

from versionable.hdf5 import Hdf5Compression, BLOSC_DEFAULT, GZIP_DEFAULT, ZSTD_DEFAULT, UNCOMPRESSED

# Use a preset
versionable.save(rec, "recording.h5", compression=GZIP_DEFAULT)

# Or build a custom configuration
comp = Hdf5Compression(algorithm="zstd", level=9)
versionable.save(rec, "recording.h5", compression=comp)

Compression is a storage concern — it does not affect the schema hash and has no impact on load(). Any compressed file can be read back regardless of what compression was used to write it, as long as the required filter is available.

Compression is set per-dataset at creation time. When resuming a session, appending to an existing dataset uses the original dataset’s compression filter, not the session’s compression parameter. The session’s compression only applies to newly created datasets.

Available presets¶

Preset	Speed	Size	When to use
`GZIP_DEFAULT`	🐢	🗜️	Default — universal compatibility
`ZSTD_DEFAULT`	🚀	🗜️	Good ratio and speed (requires hdf5plugin on reader)
`ZSTD_FAST`	⚡⚡	📦	Write speed matters more than file size
`ZSTD_BEST`	🐢	🗜️🗜️	Archival — smallest files, slower writes
`BLOSC_DEFAULT`	⚡⚡	🗜️	Large arrays — parallel blosc2 with zstd inside
`LZF`	⚡	📦	Fastest round-trip when ratio matters less than compatibility with other tools
`UNCOMPRESSED`	🐰	📦📦	Debugging, or data that doesn’t compress well

Hdf5Compression fields¶

algorithm — "zstd" | "gzip" | "lzf" | "blosc" | None. Default: "gzip". Set to None for uncompressed.
level — int | None. Default: 4. Algorithm-specific level (zstd: 1–22, gzip: 0–9, blosc: 0–9).
shuffle — bool. Default: True. Byte-shuffle filter (improves compression ratio for numeric data).
bloscCompressor — "zstd" | "blosclz" | "lz4" | "lz4hc" | "zlib". Default: "zstd". Sub-compressor used when algorithm="blosc".

The zstd and blosc algorithms are provided by the hdf5plugin package, which is included in the [hdf5] extra. See the hdf5plugin docs for full details on filter parameters and tuning options. The gzip and lzf algorithms are built into h5py and work without hdf5plugin.

The BLOSC_DEFAULT preset uses blosc2 — a meta-compressor that adds parallel blocking, byte-shuffle, and cache-aligned chunking on top of the chosen sub-compressor. Buffer alignment and block sizes are handled automatically.

Compatibility note¶

The default GZIP_DEFAULT produces files readable by every HDF5 implementation. The ZSTD_* presets (and BLOSC_DEFAULT) produce files that require hdf5plugin on the reading side as well. Use them when all readers have the plugin installed and you need better speed or ratio:

versionable.save(rec, "recording.h5", compression=ZSTD_DEFAULT)

Lazy Loading¶

By default, array fields are not read from disk until first access. This means load() returns almost instantly even for large files — the array is fetched only when your code actually uses it:

loaded = versionable.load(Recording, "recording.h5")
loaded.name    # Loaded immediately (scalar)
loaded.data    # Read from disk on first access, then cached

Lazy loading also works per-element for collection fields:

list[np.ndarray] — returns a LazyArrayList where each element loads on indexing or iteration
dict[str, np.ndarray] — returns a LazyArrayDict where each value loads on key access

loaded = versionable.load(Experiment, "experiment.h5")
loaded.traces[0]         # Loads only the first trace
loaded.channels["ch0"]   # Loads only channel "ch0"

Lazy loading is particularly useful when you have many recordings on disk and only need to inspect metadata (name, sample rate, channel count) before deciding which ones to process.

Preload¶

If you know you’ll need an array right away, you can opt into eager loading to avoid the latency hit at first access time — useful when you’re about to iterate over the data in a tight loop:

# Preload specific fields
loaded = versionable.load(Recording, "recording.h5", preload=["data"])

# Preload all arrays
loaded = versionable.load(Recording, "recording.h5", preload="*")

Metadata Only¶

Load only scalar fields and skip arrays entirely. Accessing an array field raises ArrayNotLoadedError. This is the fastest possible load — ideal for scanning a directory of files to build an index or filter by metadata before loading the full data:

loaded = versionable.load(Recording, "recording.h5", metadataOnly=True)
loaded.name    # Works
loaded.data    # Raises ArrayNotLoadedError

Save-As-You-Go Sessions¶

For scenarios where data arrives incrementally (DAQ streaming, simulation loops, long experiments), versionable.hdf5.open() provides a file-backed session that persists mutations as they happen:

from dataclasses import dataclass, field
import numpy as np
from numpy.typing import NDArray
import versionable
import versionable.hdf5
from versionable import Versionable

@dataclass
class Experiment(Versionable, version=1, hash="..."):
    name: str = ""
    sampleRate_Hz: float = 0.0
    traces: list[np.ndarray] = field(default_factory=list)
    timestamps: list[float] = field(default_factory=list)
    waveform: NDArray[np.float64] = field(default_factory=lambda: np.empty(0))

# You can pass a class (empty proxy) or an existing instance:
exp = Experiment(
    name="baseline",
    sampleRate_Hz=48000.0,
    traces=[],
    timestamps=[],
    waveform=np.empty((0, 1024)),
)

with versionable.hdf5.open(exp, "run001.h5") as exp:
    # All fields already persisted — just append
    for chunk in daq.stream():
        exp.traces.append(chunk.data)      # new dataset written to disk
        exp.timestamps.append(chunk.time)  # resizable dataset grows
        exp.waveform.append(chunk.raw)     # resizable dataset grows

# Load normally — no special API needed
exp = versionable.load(Experiment, "run001.h5")

All ndarray fields in a session are backed by resizable HDF5 datasets and wrapped with DatasetArray. Every ndarray supports .append(), element writes (write-through to disk), .resize(), and numpy interop — no annotation required.

Session Modes¶

Mode	Behavior
`"create"` (default)	New file; error if file exists
`"overwrite"`	Delete existing file if present, create new
`"resume"`	Open existing file, restore state, continue appending
`"read"`	Open existing file read-only, no mutations allowed

# Resume after a crash or between sessions
with versionable.hdf5.open(Experiment, "run001.h5", mode="resume") as exp:
    print(len(exp.traces))        # existing data is available
    exp.traces.append(new_data)   # appending continues from where it left off

# Read-only access — no mutations allowed
with versionable.hdf5.open(Experiment, "run001.h5", mode="read") as exp:
    print(np.mean(exp.waveform))  # numpy reads work
    # exp.name = "new"            # raises BackendError
    # exp.waveform[0] = 0         # raises BackendError

`Hdf5FieldInfo` — Optional Layout Hints¶

All ndarray fields are resizable by default. Use Hdf5FieldInfo only when you need to override the chunk size or append axis:

from typing import Annotated
from versionable import Hdf5FieldInfo

# Explicit axis (default: inferred from zero-size dimension, or 0)
channels: Annotated[np.ndarray, Hdf5FieldInfo(axis=1)]

# Custom chunk size (default: ~256 KB heuristic)
highRes: Annotated[np.ndarray, Hdf5FieldInfo(chunkRows=128)]

Hdf5FieldInfo is pure annotation metadata — it’s ignored by save()/load() and non-HDF5 backends. The field hashes identically to a plain np.ndarray.

Dtype Inference¶

The on-disk dtype is inferred from the field’s type annotation:

data: NDArray[np.float32]  # stored as float32 on disk, even if assigned float64

Bare np.ndarray fields use the assigned array’s dtype.

Tracked Collections¶

list[np.ndarray] — each .append() creates a new dataset in a group
list[float] / list[str] — .append() resizes a 1-D dataset
dict[str, np.ndarray] — __setitem__ creates/replaces datasets in a group
insert, pop, remove, sort, reverse raise NotImplementedError — build in memory and assign the whole list instead

`flush()` for Durability¶

These operations write through to disk automatically — no flush() needed:

DatasetArray.__setitem__ — obj.data[50] = 42.0
DatasetArray.append() / resize()
TrackedList.append() / extend() / __setitem__
TrackedDict.__setitem__ / __delitem__ / update()
Scalar field assignment — obj.name = "new"

session.flush() flushes HDF5 internal buffers to the OS, ensuring data reaches disk even if the process crashes immediately after. Call it in long-running loops where you need a durability checkpoint:

session = versionable.hdf5.open(MyClass, "out.h5")
with session as obj:
    for batch in data_source:
        obj.data.append(batch)
        session.flush()  # ensure data survives a crash

Limitations¶

Sessions do not support migrations. The file’s version and hash must exactly match the class. If your schema has changed, use versionable.load() (which supports migrations) to load the old file, then re-save with a new session.

Backends¶

Feature comparison¶

YAML¶

Missing fields¶

Comment-Out Defaults¶

JSON¶

TOML¶

Missing fields and None values¶

Comment-Out Defaults¶

HDF5¶

Compression¶

Available presets¶

Hdf5Compression fields¶

Compatibility note¶

Lazy Loading¶

Preload¶

Metadata Only¶

Save-As-You-Go Sessions¶

Session Modes¶

Hdf5FieldInfo — Optional Layout Hints¶

Dtype Inference¶

Tracked Collections¶

flush() for Durability¶

Limitations¶

`Hdf5FieldInfo` — Optional Layout Hints¶

`flush()` for Durability¶