Replace spreadsheets and wikis with a single source of truth for source-to-target data mappings. Readable by product owners and business stakeholders, parseable by tooling, and AI-native by design.
// Legacy Customer Migration
schema legacy_customers {
CUST_ID INT (pk)
CUST_TYPE CHAR(1) (enum {R, B, G})
EMAIL_ADDR VARCHAR(255) (pii) //! not validated
PHONE_NBR VARCHAR(50) //! mixed formats
COUNTRY_CD CHAR(2)
}
schema customers {
customer_id UUID (pk, required)
display_name VARCHAR(200) (required)
email VARCHAR(255) (format email, pii)
phone VARCHAR(20) (format E.164)
}
mapping `customer migration` {
source { `legacy_customers` }
target { `customers` }
CUST_ID -> customer_id { uuid_v5("6ba7b...", CUST_ID) }
EMAIL_ADDR -> email { trim | lowercase | validate_email }
-> display_name {
"If @CUST_TYPE is R or null, concat @FIRST_NM + @LAST_NM.
Otherwise use @COMPANY_NM. Trim and title-case."
}
PHONE_NBR -> phone {
"Extract digits. If 10 digits assume US +1.
For other patterns, determine country from @COUNTRY_CD."
| warn_if_invalid
}
}
The VS Code extension gives you a live, interactive overview of your entire mapping workspace — schemas, mappings, and data flow — right alongside your code.
In most enterprises, source-to-target mapping logic lives in spreadsheets, wiki pages, Confluence docs, YAML files, and tribal knowledge. Nobody trusts the docs, nobody can trace lineage, and AI can't help because there's no parseable format.
Satsuma is a concise, beautiful, parseable format that becomes the single source of truth. Like DBML for database schemas, but for data mappings. Version it in Git, validate it with tooling, and let AI agents help write it.
Satsuma's ( ) metadata accepts any token you invent. That means you can extend the language to describe anything — without waiting for an update:
Document what your tokens mean once in an LLM-Guidelines.md. From that point on, every AI tool in your org reads the spec and generates the right code — correct merge logic, correct DDL, correct tests — without re-explaining your standards in every prompt.
schema payments (
owner "payments-team",
data_domain "finance",
cost_center "CC-4200",
audit_level high,
compliance {PCI-DSS, SOX}
) {
card_number STRING (pii, encrypt AES-256,
mask last_four)
amount DECIMAL (required)
}
Designed for real-world enterprise data mapping, not toy examples.
Product owners, analysts, and non-technical stakeholders can read and review mappings without learning a programming language. Natural language is first-class.
Tree-sitter grammar with 532 corpus tests. Every CLI command produces 100% deterministically correct results from the parse tree.
Structured syntax that LLMs understand natively. CLI tooling gives AI agents deterministic workspace context to work with.
More concise than equivalent YAML or JSON specs, and
From legacy database migrations to enterprise platform modelling. Every construct serves a purpose.
// Legacy Customer Migration
import { `address fields` } from "../lib/common.stm"
note {
"""
# Legacy Customer Migration
Part of **Project Phoenix** — decommissioning the legacy SQL Server
2008 instance by Q2 2026.
## Constraints
- Runs in **batches of 10,000** to prevent memory issues
- Target enforces referential integrity — addresses before customers
- Estimated total: **2.4M records**, ~180 batches
"""
}
schema legacy_sqlserver (note "CUSTOMER table — SQL Server 2008") {
CUST_ID INT (pk)
CUST_TYPE CHAR(1) (enum {R, B, G}) //! some records have NULL
EMAIL_ADDR VARCHAR(255) (pii) //! not validated
PHONE_NBR VARCHAR(50) //! mixed formats
COUNTRY_CD CHAR(2) (default US)
CREDIT_LIMIT DECIMAL(12,2)
TAX_ID VARCHAR(20) (pii, encrypt) //! plaintext in legacy
}
schema postgres_db (note "Normalized — PostgreSQL 16") {
customer_id UUID (pk, required)
customer_type VARCHAR(20) (enum {retail, business, government})
display_name VARCHAR(200) (required)
email VARCHAR(255) (format email, pii)
phone VARCHAR(20) (format E.164)
address_id UUID (ref @addresses.id)
credit_limit_cents BIGINT (default 0)
tax_encrypted TEXT (pii, encrypt AES-256-GCM)
...`address fields`
}
mapping `customer migration` {
source { `legacy_sqlserver` }
target { `postgres_db` }
CUST_ID -> customer_id { uuid_v5("6ba7b...", CUST_ID) }
CUST_TYPE -> customer_type {
map { R: "retail", B: "business", G: "government", null: "retail" }
}
-> display_name {
"If @CUST_TYPE is R or null, concat @FIRST_NM + ' ' + @LAST_NM.
Otherwise use @COMPANY_NM. Trim and title-case."
}
EMAIL_ADDR -> email { trim | lowercase | validate_email | null_if_invalid }
PHONE_NBR -> phone {
"Extract all digits. If 10 digits, assume US country code +1.
Format as E.164. For other patterns, determine country from @COUNTRY_CD."
| warn_if_invalid
}
-> address_id {
"Create record in @addresses from @ADDR_LINE_1, @CITY, @STATE_PROV,
@ZIP_POSTAL, @COUNTRY_CD. Normalize state to 2-char code.
Deduplicate by normalized street + zip. Return UUID."
}
CREDIT_LIMIT -> credit_limit_cents { coalesce(0) | * 100 | round }
TAX_ID -> tax_encrypted {
error_if_null | encrypt(AES-256-GCM, secrets.tax_encryption_key)
}
}
Satsuma serves the entire data team, from architects to product owners.
Product managers, business analysts, and non-technical stakeholders can read and review mappings without learning code. Natural language descriptions are built right into the syntax.
Start the tutorial →Define mapping contracts that you can use with AI tools to create pipelines for your platform and tech stack. Write data quality tests, create sample test data, populate governance metadata. AI tools achieve much better implementation adherence to specs compared to loose spreadsheets. Use simple metadata extensions for data analytics modelling like Kimball or Data Vault.
Explore the CLI →Portable, version-controlled mapping specs. REST APIs, EDI, XML, COBOL, ESBs, database tables — all supported.
See examples →Trace every field from source to target. PII tags, field descriptions, encryption markers, and lineage are structural — not comments.
Learn about lineage →A structured format LLMs can reliably generate and validate. Build safe automation with the agent reference guide.
Agent reference →One mapping language across the entire platform. Namespaces, imports, and workspace graphs for multi-team modelling.
Platform modelling →Parser-backed tools that give you deterministic, correct results. Not regex heuristics.
21 commands · 824 tests
Full LSP · 179 tests
The best experience comes with the full toolchain, but Satsuma delivers real value even if you can't install anything. Locked-down laptop? Working through a web LLM? You still get the core benefits.
.stm files diff cleanly and merge naturally in any Git workflow.
Draft specs from requirements
Upload a spreadsheet or requirements doc to any LLM and ask it to produce a .stm file.
Review and explain specs
Paste a .stm file and ask for a walkthrough, risk assessment, or onboarding summary.
Generate implementation scaffolds
Ask the LLM to produce dbt, PySpark, SQL, or pandas code from your mapping spec.
Use Agent Skills
Drop-in Agent Skills for Excel ↔ Satsuma, dbt ↔ Satsuma, OpenLineage export, synthetic test data, and plain-English spec explainers.
Everything you need to know about adopting Satsuma.
No. Satsuma is not a pipeline engine or a schema modelling tool. It is a specification language that sits upstream of your implementation stack. You write your mapping intent in Satsuma, then use that spec to generate or validate implementations in dbt, PySpark, pandas, SQL, DuckDB, or whatever your team runs. Think of it as the blueprint, not the building.
Satsuma is designed to be read and written by AI agents. Give your coding agent — GitHub Copilot, OpenAI Codex, Claude Code, Cursor, Windsurf, or any agentic coding tool — a .stm file and it can generate high-fidelity scaffolds for frameworks like dbt on Snowflake, Databricks Lakehouse Declarative Pipelines, pandas, dltHub, DuckDB, PySpark, and more. The structured syntax means the agent doesn't have to guess where a field name ends and a business rule begins — ambiguity is made explicit, so the generated code is better.
Satsuma also ships a suite of Agent Skills (following the agentskills.io standard) that work out of the box with Claude Code and Claude Desktop: convert Excel ↔ Satsuma, reverse-engineer specs from dbt, scaffold dbt projects, generate synthetic test data, export OpenLineage events, and produce plain-English spec explanations for stakeholders.
You can still get real value. Satsuma files are plain text — write them in any editor and version them in Git. For AI assistance, paste the compact grammar reference into any web LLM like ChatGPT, Gemini, or Claude.ai. That's enough for the model to generate, review, and explain Satsuma specs. No CLI, no VS Code, no agent required. Read the full guide for workflows and tips.
Yes — that's one of the core motivations, but in combination with human judgement and review. Because Satsuma separates deterministic structure (field mappings, types, value maps, pipe chains) from intentionally scoped natural language (complex business rules), an AI agent can generate correct code for the deterministic parts and leave clearly marked TODOs, or ask you for the parts that need human judgement. This is a dramatic improvement over trying to generate pipelines from spreadsheets, where the agent has to guess what's a field name, what's a comment, and what's a transformation rule.
Satsuma is stack-agnostic. The same .stm file can be used to generate implementations for dbt (Snowflake, BigQuery, Redshift), Databricks Lakehouse Declarative Pipelines, PySpark, pandas, DuckDB, dltHub, SQL stored procedures, Apache Beam, and more. Integration teams can use Satsuma specs with AI assistance to build integrations in webMethods, MuleSoft, Azure Logic Apps, cloud microservices, and other middleware — the structured mapping format gives the agent exactly the context it needs to generate integration flows, not just data pipelines. Your mapping spec is the contract; the implementation target is a generation-time choice.
YAML and JSON are verbose — a simple field mapping that takes one line in Satsuma becomes 5–7 lines of YAML. Spreadsheets are worse: inconsistent layouts, no version control, and AI tools struggle to parse free-form cell formats reliably. Satsuma is purpose-built for data mappings, so it's 40–60% smaller than YAML, diffs cleanly in Git, and gives both humans and AI a format that's unambiguous by design.
Satsuma is in an early experimental stage. The language spec, parser, CLI, and VS Code extension are functional and well-tested, but the project is young and evolving. We're actively looking for feedback from data engineers, integration teams, and anyone who works with mapping specs. If you try it, we'd love to hear what works and what doesn't — open an issue or start a discussion on GitHub.
That's the primary design goal. The syntax reads like a structured document, not a programming language. Product owners, business analysts, and other non-technical stakeholders can follow the arrows (source_field -> target_field), understand metadata like (pii, required), and read natural-language transform descriptions directly. No programming experience required — there's a dedicated tutorial for non-technical readers to get started in minutes.
Satsuma is open source and free. Start mapping in minutes.