felderize

Felderize — SQL to Feldera SQL Translator

felderize translates SQL from various dialects into valid Feldera SQL using LLM-based translation with optional compiler validation.

Dialects: Spark SQL is currently the only supported dialect. Support for additional dialects is planned.

Setup

cd python/felderize
python3 -m venv .venv
source .venv/bin/activate
pip install -e .

Note: pip install -e . is required before running felderize. It registers the package and CLI command.

Download the Feldera SQL compiler JAR (requires Java 19–21 installed):

felderize download-compiler

This fetches the latest sql2dbsp-jar-with-dependencies-*.jar from GitHub Releases and saves it to ~/.felderize/. The command prints the exact path — copy it for the next step. Re-run it any time to pick up a newer release; it reports whether you are already on the latest one.

Requirement: felderize needs compiler v0.304.0 or newer — earlier releases lack SQL features felderize relies on (e.g. div_null, MAKE_DATE). download-compiler always fetches the latest release, and felderize warns at validation time if the configured compiler is older than v0.304.0.

Note: Java 19, 20, or 21 must be installed and on your PATH before running validation. Check with java -version. Later versions (22+) are not supported.

Create a .env file in the python/felderize/ directory:

ANTHROPIC_API_KEY=your-key-here
FELDERA_COMPILER=~/.felderize/sql2dbsp-jar-with-dependencies-vX.Y.Z.jar
FELDERIZE_MODEL=claude-sonnet-4-6

All three variables are required. FELDERA_COMPILER is used only for validation — translation still works without it, but output SQL is not verified. You can also pass --compiler PATH and --model MODEL per command.

Note: felderize currently requires an Anthropic API key — only Claude models are supported.

Tip — use a large-context-window model. Each request bundles the full rule set (felderize/skills/spark_skills.md), validated examples, and (on the retry pass) Feldera reference documentation on top of your schema and query — easily tens of thousands of tokens, and the --validate repair and docs passes add more. Prefer a model with a large context window (e.g. a recent Claude Sonnet/Opus) so nothing is truncated and the model has the full rules in view; set it via FELDERIZE_MODEL or --model. If a program still doesn't fit, felderize stops with an error asking you to shorten the input (translate fewer views at a time, drop unused tables, or split it into smaller files).

Usage

Run a built-in example

# List available examples
felderize spark example

# Translate an example (validates by default)
felderize spark example simple

# Without compiler validation
felderize spark example simple --no-validate

# Log SQL submitted to the validator at each attempt
felderize spark example json --verbose

# Use a specific compiler binary
felderize spark example simple --compiler /path/to/sql-to-dbsp

# Output as JSON
felderize spark example simple --json-output

Available examples:

Name	Description
`simple`	Date truncation, GROUP BY
`strings`	INITCAP, LPAD, NVL, CONCAT_WS
`arrays`	array_contains, size, element_at
`joins`	Null-safe equality (`<=>`)
`windows`	LAG, running SUM OVER
`aggregations`	COUNT DISTINCT, AVG, SUM, HAVING
`json`	get_json_object → PARSE_JSON + VARIANT access (combined file)
`topk`	ROW_NUMBER TopK, QUALIFY, datediff (combined file)
`dates`	to_date → PARSE_DATE, date_format → FORMAT_DATE/EXTRACT (combined file)
`arithmetic`	pmod, NULLIF division, subtraction (combined file)

The JSON output contains:

{
  "feldera_schema": "...",   // translated DDL (CREATE TABLE statements)
  "feldera_query": "...",    // translated query (CREATE VIEW statements)
  "unsupported": [...],      // Spark features with no Feldera equivalent
  "warnings": [...],         // non-fatal notes (compiler repairs, validation result)
  "explanations": [...],     // explanations for translation decisions
  "status": "success|unsupported|error"
}

status:

success — translated, and (with --validate) compiled cleanly.
unsupported — translated, but some constructs have no Feldera equivalent and were emitted as CAST(NULL AS <type>) placeholders (listed in unsupported).
error — the LLM response couldn't be parsed, or (with --validate) the output still failed to compile after the repair attempts. felderize always returns the best-effort SQL in feldera_query (salvaging it from the raw response when the reply wasn't valid JSON), with the compiler errors in warnings — it may not compile, but it always attempts a translation.

Translate your own SQL

Each form below writes the translated, deployable Feldera SQL to a file. The file leads with a comment header recording the translation status and any unsupported constructs / warnings, so it is self-documenting; the status is also printed to stderr.

Separate schema and query files:

felderize spark translate schema.sql query.sql --validate -o out.sql
# → out.sql   (the translated CREATE TABLE + CREATE VIEW)

Single combined file (CREATE TABLE and CREATE VIEW statements in one file):

felderize spark translate-file combined.sql --validate -o out.sql
# → out.sql

Multiple query files against a shared schema (batch — faster than a shell loop):

felderize spark translate-batch schema.sql queries/*.sql --validate --output-dir out/
# → out/<query>_feldera.sql, one per query

Structured JSON instead of a .sql file (for automation — parse with jq):

felderize spark translate schema.sql query.sql --validate --json-output > result.json
# → result.json with feldera_schema, feldera_query, status, unsupported, warnings

translate-batch processes all queries in a single process so doc and example caches stay warm across queries. Omitting -o / --output-dir / --json-output prints the result as readable sections to the terminal instead.

Note: Running without --validate prints a warning — the output SQL has not been verified against the Feldera compiler.

All commands accept:

--validate to validate output against the Feldera compiler (opt-in; example validates by default, use --no-validate to skip)
--compiler PATH to specify the path to the Feldera compiler binary (overrides FELDERA_COMPILER env var)
--model MODEL to specify the LLM model (overrides FELDERIZE_MODEL env var)
--no-docs to disable Feldera SQL reference docs in the prompt
--verbose to log the SQL submitted to the validator at each repair attempt
--json-output to output results as JSON (the structured machine interface)
-o, --output PATH (translate / translate-file) to write the translated schema + views to a deployable .sql file; the status prints to stderr so stdout/the file stay clean. (translate-batch uses --output-dir instead.)

Programmatic API (Python)

To call felderize from your own code instead of shelling out to the CLI, use the single entry point translate_spark_to_feldera:

from felderize import translate_spark_to_feldera, Config, Status

cfg = Config.from_env()          # reads ANTHROPIC_API_KEY, FELDERA_COMPILER, FELDERIZE_MODEL
result = translate_spark_to_feldera(
    schema_sql,                  # Spark CREATE TABLE ... DDL (str)
    query_sql,                   # Spark CREATE VIEW / SELECT ... (str)
    cfg,
    validate=True,               # compile against the Feldera compiler and repair (default: False)
)

if result.status is Status.SUCCESS:
    deploy(result.feldera_schema, result.feldera_query)
else:
    # UNSUPPORTED -> NULL-placeholder views (see result.unsupported);
    # ERROR       -> best-effort SQL that did not compile.
    review(result.unsupported, result.warnings)

TranslationResult exposes feldera_schema, feldera_query, status, unsupported, warnings, explanations, and to_dict(). validate=False skips the compiler (faster, but the output is not verified).

Runnable examples:

.venv/bin/python examples/api_usage.py            # translate one schema + query
.venv/bin/python examples/translate_all_examples.py   # translate all built-in examples + summary

Configuration

Environment variables (set in .env):

Variable	Description	Default
`ANTHROPIC_API_KEY`	Anthropic API key	(required)
`FELDERIZE_MODEL`	LLM model to use (can also be set with `--model`)	(required, set in `.env`)
`FELDERA_COMPILER`	Path to sql-to-dbsp compiler (can also be set with `--compiler`)	(required for validation)
`ANTHROPIC_BASE_URL`	Override Anthropic API base URL (for proxies or alternate endpoints)	(optional)

Customizing translation

You can teach felderize your project-specific patterns by adding rules and examples.

Rules

Rules tell the LLM how to rewrite specific Spark constructs. Each .md file should start with a YAML frontmatter block with a name: and description: field, followed by plain-markdown bullet points:

---
name: my-project-rules
description: Project-specific Spark-to-Feldera rewrites.
---

- **[PROJ-HASH] Internal UDF `my_hash(col)`:** Rewrite as `MD5(CAST(col AS VARCHAR))`.

- **[PROJ-ID] `CUSTOM_ID` columns:** Always map to `BIGINT NOT NULL` in Feldera.

Note: Frontmatter is recommended. Files without it are still loaded but produce a warning.

Place .md files in one of these locations — all are loaded automatically, no flag needed:

Location	Scope
`~/.felderize/rules/`	All your projects (survives pip upgrades)
`.felderize/rules/` in your project dir	This project only (commit to git)

Or pass one or more files explicitly (repeatable):

felderize spark translate schema.sql query.sql --rules rules1.md --rules rules2.md

Examples

Examples are validated Spark → Feldera pairs shown to the LLM alongside the built-in ones. The more precise your examples, the better the translation quality for your specific SQL patterns.

Each .md file must start with a YAML frontmatter block. Use categories: to load the example only when those SQL constructs are detected in the query being translated. Omit categories: (but keep the frontmatter) to always include the example:

---
categories: [datetime]
---

### Example: Monthly revenue

**Spark SQL:**
```sql
SELECT date_trunc('MONTH', ts) AS month, SUM(amount) AS revenue
FROM sales GROUP BY date_trunc('MONTH', ts);
\```

**Feldera SQL:**
```sql
SELECT FLOOR(ts TO MONTH) AS month, SUM(amount) AS revenue
FROM sales GROUP BY FLOOR(ts TO MONTH);
\```

Note: The frontmatter block (---) is required. Files without it are skipped.

Valid categories: aggregates, string, datetime, array, json, map, types.

Place .md files in one of these locations — loaded automatically, no flag needed:

Location	Scope
`~/.felderize/examples/`	All your projects (survives pip upgrades)
`.felderize/examples/` in your project dir	This project only (commit to git)

Or pass individual files or directories explicitly (repeatable, accepts both):

felderize spark translate schema.sql query.sql --examples ex1.md --examples my_examples/

How it works

felderize translates the whole program (schema + all views) in a single LLM call:

Loads translation rules from the skill file (felderize/skills/spark_skills.md).
Trims the schema to the tables the query actually references, then sends the Spark schema + query to the LLM with the rules and validated examples.
Parses the translated Feldera SQL from the response. Constructs with no Feldera equivalent are emitted as CAST(NULL AS <type>) placeholders and listed in unsupported.
With --validate, compiles the output against the Feldera compiler and repairs it using the compiler's error feedback for up to a few attempts. If that first pass still doesn't compile, it retries once more with relevant Feldera documentation added to the prompt (from docs.feldera.com/docs/sql/); use --no-docs to skip the documentation pass.

Support

Contact us at support@feldera.com for assistance with unsupported Spark SQL features.

Name		Name	Last commit message	Last commit date
parent directory ..
examples		examples
felderize		felderize
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Felderize — SQL to Feldera SQL Translator

Setup

Usage

Run a built-in example

Translate your own SQL

Programmatic API (Python)

Configuration

Customizing translation

Rules

Examples

How it works

Support

FilesExpand file tree

felderize

Directory actions

More options

Directory actions

More options

Latest commit

History

felderize

Folders and files

parent directory

README.md

Felderize — SQL to Feldera SQL Translator

Setup

Usage

Run a built-in example

Translate your own SQL

Programmatic API (Python)

Configuration

Customizing translation

Rules

Examples

How it works

Support