Skip to content

matrixorigin/git4data-tutorial

Repository files navigation

MatrixOne Git4Data Tutorial

Runnable companion code for the MatrixOne Git4Data Deep Dive article series — Git-style version control for data at scale (commit, branch, diff, merge, cherry-pick, time travel), built into MatrixOne.

What is Git4Data? If you treat a database as a Git repository and a table as a file in it, MatrixOne lets you run everyday Git operations — snapshot, clone, branch, diff, merge, cherry-pick, restore — over terabytes of data, almost instantly. It's the same workflow software engineers use on code, now on data.

The series

Part Theme Topic Code here
1 Concept The Git moment for data at scale
2 Concept Hands on: every Git primitive, from zero 02-hands-on/
3 Concept Under the hood: why snapshot/diff/merge are this fast
4 Data Ops Incident rescue: snapshot / DIFF investigation / PITR 04-incident-rescue/
5 Data Ops Collaborative development: one branch per engineer 05-collaborative-dev/
6 Data Ops Write-Audit-Publish: a release gate for data 06-write-audit-publish/
7 AI Training ML continuous learning: train only the delta 07-ml-incremental/
8 AI Training SFT curation: clean in place, with receipts 08-sft-curation/
9 AI Training Collaborative labeling: disagreement IS the conflict 09-labeling-collab/
10 AI Training RLHF preference data: consensus, re-judging, reproducibility 10-rlhf-preference/
11 AI Training Multimodal × lakeFS: bytes there, catalog here 11-multimodal-lakefs/
12 Agents Agent memory: versioned, branchable, rewindable 12-agent-memory/
13 Agents Agent traces: queryable, joinable, versioned 13-agent-trace/
14 Agents Agent self-evolution (finale): branch / evaluate / merge / roll back 14-agent-evolution/

Each later tutorial will add its own folder here.

Quick start (5 minutes)

# 1. Run a local MatrixOne (open source, MySQL-compatible)
docker run -d -p 6001:6001 --name matrixone matrixorigin/matrixone:4.0.0-rc1

# 2. Run the Part 2 walkthrough — every Git primitive on 1,000,000 rows
mysql -h 127.0.0.1 -P 6001 -u root -p111 < 02-hands-on/git4data_primitives.sql

Default credentials: user root, password 111, port 6001.

What Part 2 covers

02-hands-on/git4data_primitives.sql is a single, copy-paste-runnable script (English comments) that walks through:

  • commit / tag / resetCREATE SNAPSHOT, time-travel SELECT … {snapshot=…}, RESTORE
  • clone — zero-copy CREATE TABLE … CLONE
  • branch — lineage-tracked DATA BRANCH CREATE
  • diff — row-level DATA BRANCH DIFF … OUTPUT SUMMARY / COUNT / LIMIT / FILE
  • merge — three-way DATA BRANCH MERGE … WHEN CONFLICT FAIL | SKIP | ACCEPT
  • cherry-pickDATA BRANCH PICK … KEYS(…)
  • point-in-time recoveryCREATE PITR + RESTORE … FROM PITR "…"
  • granularity — the same semantics at table / database / account / cluster levels
  • scale — measured numbers showing snapshot/clone/branch cost is independent of table size

It loads a million rows with a single generate_series statement (no external files needed) and cleans up after itself.

Measured: cost is independent of data size

Same table, same operations, on a single-node Docker MatrixOne (diff/merge each touch only 1,000 rows):

Steady-state, median of several runs (MatrixOne 4.0.0-rc1):

table size load CREATE SNAPSHOT CLONE DATA BRANCH CREATE DIFF (1000) MERGE (1000)
1,000,000 0.5 s 6 ms 6 ms 7 ms 13 ms 64 ms
10,000,000 5.3 s 8 ms 8 ms 7 ms 21 ms 178 ms
100,000,000 41 s 5 ms 25 ms 19 ms 23 ms 189 ms

Snapshot is dead constant (it just names a metadata directory). Clone/branch copy the metadata directory, not the data — 100× the data, clone rises only 6 ms → 25 ms. Diff/merge scale with how many rows changed, not table size. (The first snapshot of a freshly loaded table is ~10–12 ms — a one-time flush of in-memory data — then drops to the steady-state numbers above.)

Links

License

Apache 2.0

About

Runnable companion code & SQL for the MatrixOne Git4Data tutorial series — Git-style version control for data at scale.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors