Custom SQL
Custom SQL executes a SQL query against your data stream using PostgreSQL-compatible syntax powered by Apache DataFusion. The upstream dataset is available as a table calledinput.
SQL dialect: PostgreSQL-compatible — CTEs, window functions, JOINs, aggregations, subqueries, CASE expressions, FILTER clauses, UNION ALL, INTERSECT, and EXCEPT are all supported.
Configuration:
- Execution engine: Choose where the query runs:
- Local (DataFusion) — default, free, runs in Planasonix infrastructure. Best for most queries.
- Warehouse — pushes SQL to your data warehouse (Snowflake, BigQuery, etc.). Best when joining pipeline data with warehouse tables.
- Spark Connect — distributed compute for very large datasets.
- SQL query: Write your query referencing data as
FROM input. - AI assistance: Use the AI prompt box to generate SQL from natural language, or “Check with AI” to validate syntax.
| Category | Functions |
|---|---|
| Aggregate | COUNT, SUM, AVG, MIN, MAX, STRING_AGG, PERCENTILE_CONT, STDDEV |
| String | UPPER, LOWER, TRIM, SUBSTRING, REPLACE, CONCAT, SPLIT_PART, LENGTH |
| Date/Time | DATE_TRUNC, DATE_PART, EXTRACT, TO_CHAR, NOW(), CURRENT_DATE, INTERVAL |
| Window | ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, NTILE, FIRST_VALUE, LAST_VALUE |
| Type | CAST, TRY_CAST, COALESCE, NULLIF |
Python / R / Scala scripts
Script nodes execute Python, R, or Scala with controlled dependencies and resource limits. Configuration:- Runtime image / packages: Approved libraries only in regulated orgs.
- Inputs / outputs: DataFrames or paths passed by the platform.
- Memory and timeouts: Set conservative defaults; iterate in notebooks first.
Notebook
Notebook integrates Jupyter-style notebooks for exploratory work and, where supported, scheduled execution of parameterized runs. Configuration:- Notebook artifact: Checked-in
.ipynbor workspace-managed file. - Parameters: Map pipeline variables to notebook parameters.
- Output capture: Logs, metrics, and written tables for audit.
Custom UDF
Custom UDF registers a user-defined function callable from SQL or transform expressions. Configuration:- Language / serialization: JVM, Python, or engine-specific UDF hooks.
- Determinism: Mark non-deterministic UDFs honestly—optimizers behave differently.
- Security: Sandboxed execution per admin policy.
LLM Transform (enterprise)
LLM Transform calls a managed large language model through your organization’s approved provider connection. Configuration:- Prompt template: Bind columns into prompts; version prompts like code.
- Model and parameters: Temperature, max tokens, safety filters.
- Cost controls: Caps per run; sampling for development.
Embedding Generator (professional+)
Embedding Generator produces vector embeddings from text columns using a connected AI/LLM provider. Configuration:- Source column: Text column to generate embeddings from.
- Output column: Name for the resulting embedding vector column.
- Model: Embedding model (e.g. OpenAI
text-embedding-3-small). - Dimensions: Vector dimensionality (model-dependent).
- API endpoint: Provider endpoint (auto-filled from connection).
| Provider | Model | Dimensions |
|---|---|---|
| OpenAI | text-embedding-3-small | 1,536 |
| OpenAI | text-embedding-3-large | 3,072 |
| text-embedding-004 | 768 |
Warehouse SQL (professional+)
Warehouse SQL pushes work natively into the warehouse optimizer—minimal data movement through Planasonix compute. Configuration:- Warehouse connection: Role, warehouse size, and warehouse-specific settings.
- Query materialization: Temp tables vs direct insert/select based on product behavior.
Spark (premium)
Spark nodes execute Apache Spark jobs (SQL, DataFrames, or JARs) on your attached cluster. Configuration:- Cluster / job mode: Yarn, Kubernetes, or serverless—per deployment.
- Resource profile: Executors, cores, shuffle tuning for skewed keys.
Streaming (premium)
Streaming nodes define continuous processing (windows, watermarks, stateful operators) on stream-back sources. Configuration:- Checkpointing and exactly-once / at-least-once semantics as offered.
- Output sinks: Streams, tables, or micro-batch handoff to batch pipelines.
Schema Evolution (premium)
Schema Evolution manages compatible schema changes in lake or warehouse tables—add columns, widen types, or evolve nested fields according to rules you set. Configuration:- Compatibility mode: Backward vs forward vs full.
- Default values for new columns.
ML Integration (premium)
ML Integration connects training or inference steps—feature stores, model registries, batch scoring—to the graph. Configuration:- Model version: Pin versions for reproducibility.
- Batch vs online: Match latency expectations.
Geospatial Operations
Geospatial nodes compute distances, buffers, intersections, and point-in-polygon joins using spatial indexes when available. Configuration:- CRS: Source and target coordinate reference systems.
- Index hints: Use spatial partitions for large point datasets.
Premium and edition labels reflect typical packaging; your tenant may differ. Confirm entitlements with your administrator before relying on a node in production design.
Operational practices
Test in isolation
Test in isolation
Wrap new code nodes with Sample upstream and Validation downstream until outputs stabilize.
Document side effects
Document side effects
If a script writes auxiliary files or calls external APIs, note it in the pipeline description and use control flow error handling.
Prefer warehouse pushdown
Prefer warehouse pushdown
When Warehouse SQL can express the logic, use it before pulling large datasets into a script node.
Related topics
Notebooks
Author and share analytical work.
Streaming and CDC
End-to-end streaming patterns.
Embedding Generator reference
Full guide to embedding models, providers, and patterns.