Advanced nodes - Planasonix

Advanced nodes embed code, notebooks, warehouse-native SQL, streaming engines, and ML or geospatial libraries where visual primitives are not enough. They trade some metadata simplicity for full expressiveness—document inputs, outputs, and side effects clearly.

Custom SQL

Custom SQL executes a SQL query against your data stream using PostgreSQL-compatible syntax powered by Apache DataFusion. The upstream dataset is available as a table called input. SQL dialect: PostgreSQL-compatible — CTEs, window functions, JOINs, aggregations, subqueries, CASE expressions, FILTER clauses, UNION ALL, INTERSECT, and EXCEPT are all supported. Configuration:

Execution engine: Choose where the query runs:
- Local (DataFusion) — default, free, runs in Planasonix infrastructure. Best for most queries.
- Warehouse — pushes SQL to your data warehouse (Snowflake, BigQuery, etc.). Best when joining pipeline data with warehouse tables.
- Spark Connect — distributed compute for very large datasets.
SQL query: Write your query referencing data as FROM input.
AI assistance: Use the AI prompt box to generate SQL from natural language, or “Check with AI” to validate syntax.

Commonly used functions:

Category	Functions
Aggregate	`COUNT`, `SUM`, `AVG`, `MIN`, `MAX`, `STRING_AGG`, `PERCENTILE_CONT`, `STDDEV`
String	`UPPER`, `LOWER`, `TRIM`, `SUBSTRING`, `REPLACE`, `CONCAT`, `SPLIT_PART`, `LENGTH`
Date/Time	`DATE_TRUNC`, `DATE_PART`, `EXTRACT`, `TO_CHAR`, `NOW()`, `CURRENT_DATE`, `INTERVAL`
Window	`ROW_NUMBER`, `RANK`, `DENSE_RANK`, `LAG`, `LEAD`, `NTILE`, `FIRST_VALUE`, `LAST_VALUE`
Type	`CAST`, `TRY_CAST`, `COALESCE`, `NULLIF`

Example — deduplicate keeping latest:

SELECT * FROM (
  SELECT *, ROW_NUMBER() OVER (
    PARTITION BY email ORDER BY updated_at DESC
  ) AS rn FROM input
) sub WHERE rn = 1

PIVOT / UNPIVOT SQL syntax is not supported. Use CASE WHEN with aggregation for pivoting, or UNION ALL for unpivoting.

Typical use: Complex multi-step CTEs, deduplication, running totals, pivoting, outlier detection, and any transformation that’s more natural in SQL than visual nodes.

Python / R / Scala scripts

Script nodes execute Python, R, or Scala with controlled dependencies and resource limits. Configuration:

Runtime image / packages: Approved libraries only in regulated orgs.
Inputs / outputs: DataFrames or paths passed by the platform.
Memory and timeouts: Set conservative defaults; iterate in notebooks first.

Typical use: Statistical routines, custom parsers, or quick glue when SQL is awkward.

Notebook

Notebook integrates Jupyter-style notebooks for exploratory work and, where supported, scheduled execution of parameterized runs. Configuration:

Notebook artifact: Checked-in .ipynb or workspace-managed file.
Parameters: Map pipeline variables to notebook parameters.
Output capture: Logs, metrics, and written tables for audit.

Typical use: Analyst workflows that graduate from ad hoc cells to scheduled production after review.

Custom UDF

Custom UDF registers a user-defined function callable from SQL or transform expressions. Configuration:

Language / serialization: JVM, Python, or engine-specific UDF hooks.
Determinism: Mark non-deterministic UDFs honestly—optimizers behave differently.
Security: Sandboxed execution per admin policy.

Typical use: Reuse one tricky parsing function across many SQL nodes without copy-paste.

LLM Transform (enterprise)

LLM Transform calls a managed large language model through your organization’s approved provider connection. Configuration:

Prompt template: Bind columns into prompts; version prompts like code.
Model and parameters: Temperature, max tokens, safety filters.
Cost controls: Caps per run; sampling for development.

Typical use: Generate product descriptions from attributes—never send unreviewed PII without policy approval.

Embedding Generator (professional+)

Embedding Generator produces vector embeddings from text columns using a connected AI/LLM provider. Configuration:

Source column: Text column to generate embeddings from.
Output column: Name for the resulting embedding vector column.
Model: Embedding model (e.g. OpenAI text-embedding-3-small).
Dimensions: Vector dimensionality (model-dependent).
API endpoint: Provider endpoint (auto-filled from connection).

Supported providers:

Provider	Model	Dimensions
OpenAI	text-embedding-3-small	1,536
OpenAI	text-embedding-3-large	3,072
Google	text-embedding-004	768

Typical use: Generate embeddings for product descriptions to enable semantic search in a vector database downstream.

Embedding generation calls an external API — costs apply per token. Use Sample upstream during development to control costs.

Warehouse SQL (professional+)

Warehouse SQL pushes work natively into the warehouse optimizer—minimal data movement through Planasonix compute. Configuration:

Warehouse connection: Role, warehouse size, and warehouse-specific settings.
Query materialization: Temp tables vs direct insert/select based on product behavior.

Typical use: Heavy aggregations on terabyte fact tables where local processing is cost-prohibitive.

Spark (premium)

Spark nodes execute Apache Spark jobs (SQL, DataFrames, or JARs) on your attached cluster. Configuration:

Cluster / job mode: Yarn, Kubernetes, or serverless—per deployment.
Resource profile: Executors, cores, shuffle tuning for skewed keys.

Typical use: Large-scale cleansing, ML feature engineering, or graph algorithms not exposed as native nodes.

Streaming (premium)

Streaming nodes define continuous processing (windows, watermarks, stateful operators) on stream-back sources. Configuration:

Checkpointing and exactly-once / at-least-once semantics as offered.
Output sinks: Streams, tables, or micro-batch handoff to batch pipelines.

Typical use: Sessionization of clickstream data before landing hourly aggregates.

Schema Evolution (premium)

Schema Evolution manages compatible schema changes in lake or warehouse tables—add columns, widen types, or evolve nested fields according to rules you set. Configuration:

Compatibility mode: Backward vs forward vs full.
Default values for new columns.

Typical use: Vendor adds optional JSON fields; evolve tables without breaking nightly loads.

ML Integration (premium)

ML Integration connects training or inference steps—feature stores, model registries, batch scoring—to the graph. Configuration:

Model version: Pin versions for reproducibility.
Batch vs online: Match latency expectations.

Typical use: Nightly scoring of churn probability back into a warehouse feature table.

Geospatial Operations

Geospatial nodes compute distances, buffers, intersections, and point-in-polygon joins using spatial indexes when available. Configuration:

CRS: Source and target coordinate reference systems.
Index hints: Use spatial partitions for large point datasets.

Typical use: Assign retail transactions to store trade areas for foot-traffic analytics.

Premium and edition labels reflect typical packaging; your tenant may differ. Confirm entitlements with your administrator before relying on a node in production design.

Operational practices

Test in isolation

Wrap new code nodes with Sample upstream and Validation downstream until outputs stabilize.

Document side effects

If a script writes auxiliary files or calls external APIs, note it in the pipeline description and use control flow error handling.

Prefer warehouse pushdown

When Warehouse SQL can express the logic, use it before pulling large datasets into a script node.

Notebooks

Author and share analytical work.

Streaming and CDC

End-to-end streaming patterns.

Embedding Generator reference

Full guide to embedding models, providers, and patterns.

​Custom SQL

​Python / R / Scala scripts

​Notebook

​Custom UDF

​LLM Transform (enterprise)

​Embedding Generator (professional+)

​Warehouse SQL (professional+)

​Spark (premium)

​Streaming (premium)

​Schema Evolution (premium)

​ML Integration (premium)

​Geospatial Operations

​Operational practices

​Related topics

Notebooks

Streaming and CDC

Embedding Generator reference

Custom SQL

Python / R / Scala scripts

Notebook

Custom UDF

LLM Transform (enterprise)

Embedding Generator (professional+)

Warehouse SQL (professional+)

Spark (premium)

Streaming (premium)

Schema Evolution (premium)

ML Integration (premium)

Geospatial Operations

Operational practices

Related topics