Satsuma

The mapping language humans can read, machines can parse, and AIs can reason about.

Replace spreadsheets and wikis with a single source of truth for source-to-target data mappings. Readable by product owners and business stakeholders, parseable by tooling, and AI-native by design.

customer_migration.stm
// Legacy Customer Migration

schema legacy_customers {
  CUST_ID     INT           (pk)
  CUST_TYPE   CHAR(1)       (enum {R, B, G})
  EMAIL_ADDR  VARCHAR(255)  (pii)       //! not validated
  PHONE_NBR   VARCHAR(50)                  //! mixed formats
  COUNTRY_CD  CHAR(2)
}

schema customers {
  customer_id  UUID          (pk, required)
  display_name VARCHAR(200)  (required)
  email        VARCHAR(255)  (format email, pii)
  phone        VARCHAR(20)   (format E.164)
}

mapping `customer migration` {
  source { `legacy_customers` }
  target { `customers` }

  CUST_ID    -> customer_id { uuid_v5("6ba7b...", CUST_ID) }
  EMAIL_ADDR -> email       { trim | lowercase | validate_email }

  -> display_name {
    "If @CUST_TYPE is R or null, concat @FIRST_NM + @LAST_NM.
     Otherwise use @COMPANY_NM. Trim and title-case."
  }

  PHONE_NBR -> phone {
    "Extract digits. If 10 digits assume US +1.
     For other patterns, determine country from @COUNTRY_CD."
    | warn_if_invalid
  }
}
See it in action

The VS Code extension gives you a live, interactive overview of your entire mapping workspace — schemas, mappings, and data flow — right alongside your code.

Satsuma VS Code extension showing the workspace overview with schema nodes, mapping blocks, and directional arrows
The Problem

Data mappings are scattered everywhere

In most enterprises, source-to-target mapping logic lives in spreadsheets, wiki pages, Confluence docs, YAML files, and tribal knowledge. Nobody trusts the docs, nobody can trace lineage, and AI can't help because there's no parseable format.

Excel specs that are outdated by the time they're reviewed
No way to diff, lint, or validate mapping changes
AI can't reason well about human-optimised spreadsheets with free-form layouts or derive high-fidelity pipeline code
The Solution

One language for all your mappings

Satsuma is a concise, beautiful, parseable format that becomes the single source of truth. Like DBML for database schemas, but for data mappings. Version it in Git, validate it with tooling, and let AI agents help write it.

Text-based, diffable, version-controlled
Parser-backed tooling: lint, validate, lineage, graph
LLMs generate valid Satsuma using the parsing tools, and can generate high-fidelity pipeline implementations, DQ tests and sample data in your tech stack of choice, with areas of ambiguity made explicit.
The superpower

Your conventions. Your tokens.
No language changes. Ever.

Satsuma's ( ) metadata accepts any token you invent. That means you can extend the language to describe anything — without waiting for an update:

  • Exotic file formats — COBOL copybooks, EDI qualifiers, HL7 segments, ISO 8583 bitmaps
  • Governance metadata — PII flags, classification, retention, masking, compliance frameworks
  • Pipeline construction — merge strategies, SCD types, match keys, refresh schedules
  • Analytics modelling — Kimball dimensions, Data Vault hubs, grain declarations, conformed keys
  • Your org’s standards — cost centres, data domains, audit levels, SLA tiers

Document what your tokens mean once in an LLM-Guidelines.md. From that point on, every AI tool in your org reads the spec and generates the right code — correct merge logic, correct DDL, correct tests — without re-explaining your standards in every prompt.

your_conventions.stm
schema payments (
  owner "payments-team",
  data_domain "finance",
  cost_center "CC-4200",
  audit_level high,
  compliance {PCI-DSS, SOX}
) {
  card_number  STRING   (pii, encrypt AES-256,
    mask last_four)
  amount       DECIMAL  (required)
}

Why teams choose Satsuma

Designed for real-world enterprise data mapping, not toy examples.

Human-Readable

Product owners, analysts, and non-technical stakeholders can read and review mappings without learning a programming language. Natural language is first-class.

Machine-Parseable

Tree-sitter grammar with 532 corpus tests. Every CLI command produces 100% deterministically correct results from the parse tree.

AI-Native

Structured syntax that LLMs understand natively. CLI tooling gives AI agents deterministic workspace context to work with.

40-60% Smaller

More concise than equivalent YAML or JSON specs, and 3-8x less token usage than mapping spreadsheets. Less noise, more signal.

21
CLI Commands
532
Parser Tests
824
CLI Tests
179
LSP Tests

Real-world mappings, beautifully expressed

From legacy database migrations to enterprise platform modelling. Every construct serves a purpose.

db-to-db/pipeline.stm
// Legacy Customer Migration
import { `address fields` } from "../lib/common.stm"

note {
  """
  # Legacy Customer Migration

  Part of **Project Phoenix** — decommissioning the legacy SQL Server
  2008 instance by Q2 2026.

  ## Constraints
  - Runs in **batches of 10,000** to prevent memory issues
  - Target enforces referential integrity — addresses before customers
  - Estimated total: **2.4M records**, ~180 batches
  """
}

schema legacy_sqlserver (note "CUSTOMER table — SQL Server 2008") {
  CUST_ID       INT           (pk)
  CUST_TYPE     CHAR(1)       (enum {R, B, G})    //! some records have NULL
  EMAIL_ADDR    VARCHAR(255)  (pii)               //! not validated
  PHONE_NBR     VARCHAR(50)                         //! mixed formats
  COUNTRY_CD    CHAR(2)       (default US)
  CREDIT_LIMIT  DECIMAL(12,2)
  TAX_ID        VARCHAR(20)   (pii, encrypt)      //! plaintext in legacy
}

schema postgres_db (note "Normalized — PostgreSQL 16") {
  customer_id       UUID          (pk, required)
  customer_type     VARCHAR(20)   (enum {retail, business, government})
  display_name      VARCHAR(200)  (required)
  email             VARCHAR(255)  (format email, pii)
  phone             VARCHAR(20)   (format E.164)
  address_id        UUID          (ref @addresses.id)
  credit_limit_cents BIGINT       (default 0)
  tax_encrypted     TEXT          (pii, encrypt AES-256-GCM)
  ...`address fields`
}

mapping `customer migration` {
  source { `legacy_sqlserver` }
  target { `postgres_db` }

  CUST_ID -> customer_id { uuid_v5("6ba7b...", CUST_ID) }

  CUST_TYPE -> customer_type {
    map { R: "retail", B: "business", G: "government", null: "retail" }
  }

  -> display_name {
    "If @CUST_TYPE is R or null, concat @FIRST_NM + ' ' + @LAST_NM.
     Otherwise use @COMPANY_NM. Trim and title-case."
  }

  EMAIL_ADDR -> email { trim | lowercase | validate_email | null_if_invalid }

  PHONE_NBR -> phone {
    "Extract all digits. If 10 digits, assume US country code +1.
     Format as E.164. For other patterns, determine country from @COUNTRY_CD."
    | warn_if_invalid
  }

  -> address_id {
    "Create record in @addresses from @ADDR_LINE_1, @CITY, @STATE_PROV,
     @ZIP_POSTAL, @COUNTRY_CD. Normalize state to 2-char code.
     Deduplicate by normalized street + zip. Return UUID."
  }

  CREDIT_LIMIT -> credit_limit_cents { coalesce(0) | * 100 | round }

  TAX_ID -> tax_encrypted {
    error_if_null | encrypt(AES-256-GCM, secrets.tax_encryption_key)
  }
}

Built for your role

Satsuma serves the entire data team, from architects to product owners.

Professional tooling, built-in

Parser-backed tools that give you deterministic, correct results. Not regex heuristics.

No installation required

Works without the CLI or VS Code extension

The best experience comes with the full toolchain, but Satsuma delivers real value even if you can't install anything. Locked-down laptop? Working through a web LLM? You still get the core benefits.

Token-efficient specs — 3–8x more compact than spreadsheets. Fits in any LLM context window.
Plain text, version-controlled.stm files diff cleanly and merge naturally in any Git workflow.
Works with any LLM — paste the compact grammar reference into ChatGPT, Gemini, or Claude.ai and start writing specs.
Graduate when ready — when your environment allows installation, the CLI and extension slot in without changing your files.
Read the full guide →

Example workflows without tooling

1

Draft specs from requirements

Upload a spreadsheet or requirements doc to any LLM and ask it to produce a .stm file.

2

Review and explain specs

Paste a .stm file and ask for a walkthrough, risk assessment, or onboarding summary.

3

Generate implementation scaffolds

Ask the LLM to produce dbt, PySpark, SQL, or pandas code from your mapping spec.

4

Use Agent Skills

Drop-in Agent Skills for Excel ↔ Satsuma, dbt ↔ Satsuma, OpenLineage export, synthetic test data, and plain-English spec explainers.

Frequently asked questions

Everything you need to know about adopting Satsuma.

Does Satsuma replace dbt, DBML, PySpark, or my existing data tools?

No. Satsuma is not a pipeline engine or a schema modelling tool. It is a specification language that sits upstream of your implementation stack. You write your mapping intent in Satsuma, then use that spec to generate or validate implementations in dbt, PySpark, pandas, SQL, DuckDB, or whatever your team runs. Think of it as the blueprint, not the building.

How does Satsuma work with AI coding agents?

Satsuma is designed to be read and written by AI agents. Give your coding agent — GitHub Copilot, OpenAI Codex, Claude Code, Cursor, Windsurf, or any agentic coding tool — a .stm file and it can generate high-fidelity scaffolds for frameworks like dbt on Snowflake, Databricks Lakehouse Declarative Pipelines, pandas, dltHub, DuckDB, PySpark, and more. The structured syntax means the agent doesn't have to guess where a field name ends and a business rule begins — ambiguity is made explicit, so the generated code is better.

Satsuma also ships a suite of Agent Skills (following the agentskills.io standard) that work out of the box with Claude Code and Claude Desktop: convert Excel ↔ Satsuma, reverse-engineer specs from dbt, scaffold dbt projects, generate synthetic test data, export OpenLineage events, and produce plain-English spec explanations for stakeholders.

What if I'm not allowed to install the CLI or don't have any agentic coding tools?

You can still get real value. Satsuma files are plain text — write them in any editor and version them in Git. For AI assistance, paste the compact grammar reference into any web LLM like ChatGPT, Gemini, or Claude.ai. That's enough for the model to generate, review, and explain Satsuma specs. No CLI, no VS Code, no agent required. Read the full guide for workflows and tips.

Can AI agents generate entire pipeline implementations from a Satsuma spec?

Yes — that's one of the core motivations, but in combination with human judgement and review. Because Satsuma separates deterministic structure (field mappings, types, value maps, pipe chains) from intentionally scoped natural language (complex business rules), an AI agent can generate correct code for the deterministic parts and leave clearly marked TODOs, or ask you for the parts that need human judgement. This is a dramatic improvement over trying to generate pipelines from spreadsheets, where the agent has to guess what's a field name, what's a comment, and what's a transformation rule.

What tech stacks can I target from a Satsuma spec?

Satsuma is stack-agnostic. The same .stm file can be used to generate implementations for dbt (Snowflake, BigQuery, Redshift), Databricks Lakehouse Declarative Pipelines, PySpark, pandas, DuckDB, dltHub, SQL stored procedures, Apache Beam, and more. Integration teams can use Satsuma specs with AI assistance to build integrations in webMethods, MuleSoft, Azure Logic Apps, cloud microservices, and other middleware — the structured mapping format gives the agent exactly the context it needs to generate integration flows, not just data pipelines. Your mapping spec is the contract; the implementation target is a generation-time choice.

Why not just use YAML, JSON, or a spreadsheet?

YAML and JSON are verbose — a simple field mapping that takes one line in Satsuma becomes 5–7 lines of YAML. Spreadsheets are worse: inconsistent layouts, no version control, and AI tools struggle to parse free-form cell formats reliably. Satsuma is purpose-built for data mappings, so it's 40–60% smaller than YAML, diffs cleanly in Git, and gives both humans and AI a format that's unambiguous by design.

How mature is Satsuma? Is it production-ready?

Satsuma is in an early experimental stage. The language spec, parser, CLI, and VS Code extension are functional and well-tested, but the project is young and evolving. We're actively looking for feedback from data engineers, integration teams, and anyone who works with mapping specs. If you try it, we'd love to hear what works and what doesn't — open an issue or start a discussion on GitHub.

Can non-technical people read Satsuma?

That's the primary design goal. The syntax reads like a structured document, not a programming language. Product owners, business analysts, and other non-technical stakeholders can follow the arrows (source_field -> target_field), understand metadata like (pii, required), and read natural-language transform descriptions directly. No programming experience required — there's a dedicated tutorial for non-technical readers to get started in minutes.

Satsuma

Ready to simplify your data mappings?

Satsuma is open source and free. Start mapping in minutes.