felderize translates SQL from various dialects into valid Feldera SQL using LLM-based translation with optional compiler validation.
Dialects: Spark SQL is currently the only supported dialect. Support for additional dialects is planned.
cd python/felderize
python3 -m venv .venv
source .venv/bin/activate
pip install -e .Note:
pip install -e .is required before runningfelderize. It registers the package and CLI command.
Download the Feldera SQL compiler JAR (requires Java 19–21 installed):
felderize download-compilerThis fetches the latest sql2dbsp-jar-with-dependencies-*.jar from GitHub Releases and saves it to ~/.felderize/. The command prints the exact path — copy it for the next step. Re-run it any time to pick up a newer release; it reports whether you are already on the latest one.
Requirement: felderize needs compiler v0.304.0 or newer — earlier releases lack SQL features felderize relies on (e.g.
div_null,MAKE_DATE).download-compileralways fetches the latest release, and felderize warns at validation time if the configured compiler is older than v0.304.0.
Note: Java 19, 20, or 21 must be installed and on your
PATHbefore running validation. Check withjava -version. Later versions (22+) are not supported.
Create a .env file in the python/felderize/ directory:
ANTHROPIC_API_KEY=your-key-here
FELDERA_COMPILER=~/.felderize/sql2dbsp-jar-with-dependencies-vX.Y.Z.jar
FELDERIZE_MODEL=claude-sonnet-4-6All three variables are required. FELDERA_COMPILER is used only for validation — translation still works without it, but output SQL is not verified. You can also pass --compiler PATH and --model MODEL per command.
Note: felderize currently requires an Anthropic API key — only Claude models are supported.
Tip — use a large-context-window model. Each request bundles the full rule set (
felderize/skills/spark_skills.md), validated examples, and (on the retry pass) Feldera reference documentation on top of your schema and query — easily tens of thousands of tokens, and the--validaterepair and docs passes add more. Prefer a model with a large context window (e.g. a recent Claude Sonnet/Opus) so nothing is truncated and the model has the full rules in view; set it viaFELDERIZE_MODELor--model. If a program still doesn't fit, felderize stops with anerrorasking you to shorten the input (translate fewer views at a time, drop unused tables, or split it into smaller files).
# List available examples
felderize spark example
# Translate an example (validates by default)
felderize spark example simple
# Without compiler validation
felderize spark example simple --no-validate
# Log SQL submitted to the validator at each attempt
felderize spark example json --verbose
# Use a specific compiler binary
felderize spark example simple --compiler /path/to/sql-to-dbsp
# Output as JSON
felderize spark example simple --json-outputAvailable examples:
| Name | Description |
|---|---|
simple |
Date truncation, GROUP BY |
strings |
INITCAP, LPAD, NVL, CONCAT_WS |
arrays |
array_contains, size, element_at |
joins |
Null-safe equality (<=>) |
windows |
LAG, running SUM OVER |
aggregations |
COUNT DISTINCT, AVG, SUM, HAVING |
json |
get_json_object → PARSE_JSON + VARIANT access (combined file) |
topk |
ROW_NUMBER TopK, QUALIFY, datediff (combined file) |
dates |
to_date → PARSE_DATE, date_format → FORMAT_DATE/EXTRACT (combined file) |
arithmetic |
pmod, NULLIF division, subtraction (combined file) |
The JSON output contains:
{
"feldera_schema": "...", // translated DDL (CREATE TABLE statements)
"feldera_query": "...", // translated query (CREATE VIEW statements)
"unsupported": [...], // Spark features with no Feldera equivalent
"warnings": [...], // non-fatal notes (compiler repairs, validation result)
"explanations": [...], // explanations for translation decisions
"status": "success|unsupported|error"
}status:
success— translated, and (with--validate) compiled cleanly.unsupported— translated, but some constructs have no Feldera equivalent and were emitted asCAST(NULL AS <type>)placeholders (listed inunsupported).error— the LLM response couldn't be parsed, or (with--validate) the output still failed to compile after the repair attempts. felderize always returns the best-effort SQL infeldera_query(salvaging it from the raw response when the reply wasn't valid JSON), with the compiler errors inwarnings— it may not compile, but it always attempts a translation.
Each form below writes the translated, deployable Feldera SQL to a file. The file leads with a comment header recording the translation status and any unsupported constructs / warnings, so it is self-documenting; the status is also printed to stderr.
Separate schema and query files:
felderize spark translate schema.sql query.sql --validate -o out.sql
# → out.sql (the translated CREATE TABLE + CREATE VIEW)Single combined file (CREATE TABLE and CREATE VIEW statements in one file):
felderize spark translate-file combined.sql --validate -o out.sql
# → out.sqlMultiple query files against a shared schema (batch — faster than a shell loop):
felderize spark translate-batch schema.sql queries/*.sql --validate --output-dir out/
# → out/<query>_feldera.sql, one per queryStructured JSON instead of a .sql file (for automation — parse with jq):
felderize spark translate schema.sql query.sql --validate --json-output > result.json
# → result.json with feldera_schema, feldera_query, status, unsupported, warningstranslate-batch processes all queries in a single process so doc and example
caches stay warm across queries. Omitting -o / --output-dir / --json-output
prints the result as readable sections to the terminal instead.
Note: Running without
--validateprints a warning — the output SQL has not been verified against the Feldera compiler.
All commands accept:
--validateto validate output against the Feldera compiler (opt-in;examplevalidates by default, use--no-validateto skip)--compiler PATHto specify the path to the Feldera compiler binary (overridesFELDERA_COMPILERenv var)--model MODELto specify the LLM model (overridesFELDERIZE_MODELenv var)--no-docsto disable Feldera SQL reference docs in the prompt--verboseto log the SQL submitted to the validator at each repair attempt--json-outputto output results as JSON (the structured machine interface)-o, --output PATH(translate/translate-file) to write the translated schema + views to a deployable.sqlfile; the status prints to stderr so stdout/the file stay clean. (translate-batchuses--output-dirinstead.)
To call felderize from your own code instead of shelling out to the CLI, use the
single entry point translate_spark_to_feldera:
from felderize import translate_spark_to_feldera, Config, Status
cfg = Config.from_env() # reads ANTHROPIC_API_KEY, FELDERA_COMPILER, FELDERIZE_MODEL
result = translate_spark_to_feldera(
schema_sql, # Spark CREATE TABLE ... DDL (str)
query_sql, # Spark CREATE VIEW / SELECT ... (str)
cfg,
validate=True, # compile against the Feldera compiler and repair (default: False)
)
if result.status is Status.SUCCESS:
deploy(result.feldera_schema, result.feldera_query)
else:
# UNSUPPORTED -> NULL-placeholder views (see result.unsupported);
# ERROR -> best-effort SQL that did not compile.
review(result.unsupported, result.warnings)TranslationResult exposes feldera_schema, feldera_query, status,
unsupported, warnings, explanations, and to_dict(). validate=False
skips the compiler (faster, but the output is not verified).
Runnable examples:
.venv/bin/python examples/api_usage.py # translate one schema + query
.venv/bin/python examples/translate_all_examples.py # translate all built-in examples + summaryEnvironment variables (set in .env):
| Variable | Description | Default |
|---|---|---|
ANTHROPIC_API_KEY |
Anthropic API key | (required) |
FELDERIZE_MODEL |
LLM model to use (can also be set with --model) |
(required, set in .env) |
FELDERA_COMPILER |
Path to sql-to-dbsp compiler (can also be set with --compiler) |
(required for validation) |
ANTHROPIC_BASE_URL |
Override Anthropic API base URL (for proxies or alternate endpoints) | (optional) |
You can teach felderize your project-specific patterns by adding rules and examples.
Rules tell the LLM how to rewrite specific Spark constructs. Each .md file should start with a YAML frontmatter block with a name: and description: field, followed by plain-markdown bullet points:
---
name: my-project-rules
description: Project-specific Spark-to-Feldera rewrites.
---
- **[PROJ-HASH] Internal UDF `my_hash(col)`:** Rewrite as `MD5(CAST(col AS VARCHAR))`.
- **[PROJ-ID] `CUSTOM_ID` columns:** Always map to `BIGINT NOT NULL` in Feldera.Note: Frontmatter is recommended. Files without it are still loaded but produce a warning.
Place .md files in one of these locations — all are loaded automatically, no flag needed:
| Location | Scope |
|---|---|
~/.felderize/rules/ |
All your projects (survives pip upgrades) |
.felderize/rules/ in your project dir |
This project only (commit to git) |
Or pass one or more files explicitly (repeatable):
felderize spark translate schema.sql query.sql --rules rules1.md --rules rules2.mdExamples are validated Spark → Feldera pairs shown to the LLM alongside the built-in ones. The more precise your examples, the better the translation quality for your specific SQL patterns.
Each .md file must start with a YAML frontmatter block. Use categories: to load the example only when those SQL constructs are detected in the query being translated. Omit categories: (but keep the frontmatter) to always include the example:
---
categories: [datetime]
---
### Example: Monthly revenue
**Spark SQL:**
```sql
SELECT date_trunc('MONTH', ts) AS month, SUM(amount) AS revenue
FROM sales GROUP BY date_trunc('MONTH', ts);
\```
**Feldera SQL:**
```sql
SELECT FLOOR(ts TO MONTH) AS month, SUM(amount) AS revenue
FROM sales GROUP BY FLOOR(ts TO MONTH);
\```Note: The frontmatter block (
---) is required. Files without it are skipped.
Valid categories: aggregates, string, datetime, array, json, map, types.
Place .md files in one of these locations — loaded automatically, no flag needed:
| Location | Scope |
|---|---|
~/.felderize/examples/ |
All your projects (survives pip upgrades) |
.felderize/examples/ in your project dir |
This project only (commit to git) |
Or pass individual files or directories explicitly (repeatable, accepts both):
felderize spark translate schema.sql query.sql --examples ex1.md --examples my_examples/felderize translates the whole program (schema + all views) in a single LLM call:
- Loads translation rules from the skill file (
felderize/skills/spark_skills.md). - Trims the schema to the tables the query actually references, then sends the Spark schema + query to the LLM with the rules and validated examples.
- Parses the translated Feldera SQL from the response. Constructs with no Feldera
equivalent are emitted as
CAST(NULL AS <type>)placeholders and listed inunsupported. - With
--validate, compiles the output against the Feldera compiler and repairs it using the compiler's error feedback for up to a few attempts. If that first pass still doesn't compile, it retries once more with relevant Feldera documentation added to the prompt (fromdocs.feldera.com/docs/sql/); use--no-docsto skip the documentation pass.
Contact us at support@feldera.com for assistance with unsupported Spark SQL features.