Filter
Filter keeps rows that match a boolean expression. Configuration:- Expression: A predicate evaluated per row (for example,
order_status = 'paid'andorder_total > 0). - Null handling: Decide how
NULLcomparisons behave; explicitIS NULLchecks avoid surprises.
event_name is purchase_completed and currency is USD:
Sample
Sample takes a random or stratified subset of rows for exploration, testing, or cost control. Configuration:- Sample rate or row limit: Fixed fraction (
10%) or max rows (50_000). - Seed (when available): Reproducible samples for tests.
- Stratification keys (when available): Preserve rare segment representation.
5% of transactions stratified by country so evaluation metrics are not dominated by one region.
When to use: Development pipelines, QA harnesses, or staged rollouts where full scans are unnecessary.
Sort
Sort orders rows by one or more keys. Configuration:- Sort keys: Columns with ascending or descending order.
- Nulls first/last: Explicit placement prevents flaky joins or window partitions.
- Stability: Pair sort with a tie-breaker column (such as
event_id) when duplicates on sort keys exist.
(customer_id, effective_date); sort by those columns so merge operators observe deterministic runs and easier diffing in logs.
When to use: Before nodes that assume order (some window setups, certain file writers), or when breaking ties for deduplication.
Unique (deduplicate)
Unique removes duplicate rows—either full-row duplicates or duplicates by a key subset. Configuration:- Key columns: Deduplicate on
user_idwhile keeping the first or last row per key according to sort order. - Keep policy: First vs last requires an upstream Sort when order matters.
- Hash vs key: Full-row unique is cheap mentally; key-based unique matches business keys.
event_id. Sort by ingest_timestamp descending, then Unique on event_id keeping the first row to retain the latest version.
When to use: After merges of streams, before cardinality-sensitive aggregations, or prior to loads that enforce primary keys.
Explode
Explode converts array elements into separate rows, duplicating non-array columns for each element. Configuration:- Explode column: The array column to expand (e.g.
tags,line_items). - Keep original: Optionally retain the original array column alongside the exploded rows.
tags: ["a", "b", "c"] produces three output rows — one per tag — with all other columns duplicated.
When to use: After parsing JSON arrays from REST APIs or event payloads, before aggregating per-element.
Z-Order Sort (professional+)
Z-Order Sort reorders rows using a space-filling curve that preserves multi-dimensional data locality. Query engines that read the output files can skip large ranges when filtering on any combination of the sorted columns. Configuration:- Columns: Two or more columns to include in the z-order sort.
region and order_date, z-ordering on both columns keeps related data co-located so queries skip 70–85% of files.
When to use: Before writing to Iceberg or Delta Lake destinations, especially when downstream queries filter on multiple columns simultaneously. See the full Z-Order Sort reference for column selection tips and performance impact.
Patterns on the canvas
Shrink early
Place Filter and column projection (where available) close to the source to save compute on joins.
Sort before deterministic dedupe
When duplicate resolution depends on time or version columns, Sort then Unique.
Related nodes
Column transforms
Change types, derive fields, and apply window logic.
Aggregation
Roll up after you have narrowed and cleaned rows.
Z-Order Sort reference
Full guide to multi-dimensional sort configuration and performance.
Explode reference
Detailed explode configuration and patterns.