Skip to main content
Row transforms operate on records without changing the fundamental column layout (except where a node explicitly adds helper columns). You use them to cut volume, impose ordering for deterministic processing, probe datasets, and remove duplicates.

Filter

Filter keeps rows that match a boolean expression. Configuration:
  • Expression: A predicate evaluated per row (for example, order_status = 'paid' and order_total > 0).
  • Null handling: Decide how NULL comparisons behave; explicit IS NULL checks avoid surprises.
Example: You ingest clickstream events but only want rows where event_name is purchase_completed and currency is USD:
event_name = 'purchase_completed' AND currency = 'USD'
(Exact expression syntax follows your workspace dialect—use the in-app helper and preview.) When to use: Early in the graph to reduce bytes moved through expensive joins.

Sample

Sample takes a random or stratified subset of rows for exploration, testing, or cost control. Configuration:
  • Sample rate or row limit: Fixed fraction (10%) or max rows (50_000).
  • Seed (when available): Reproducible samples for tests.
  • Stratification keys (when available): Preserve rare segment representation.
Example: Before training a fraud model downstream, sample 5% of transactions stratified by country so evaluation metrics are not dominated by one region. When to use: Development pipelines, QA harnesses, or staged rollouts where full scans are unnecessary.

Sort

Sort orders rows by one or more keys. Configuration:
  • Sort keys: Columns with ascending or descending order.
  • Nulls first/last: Explicit placement prevents flaky joins or window partitions.
  • Stability: Pair sort with a tie-breaker column (such as event_id) when duplicates on sort keys exist.
Example: You batch-update a warehouse table keyed by (customer_id, effective_date); sort by those columns so merge operators observe deterministic runs and easier diffing in logs. When to use: Before nodes that assume order (some window setups, certain file writers), or when breaking ties for deduplication.

Unique (deduplicate)

Unique removes duplicate rows—either full-row duplicates or duplicates by a key subset. Configuration:
  • Key columns: Deduplicate on user_id while keeping the first or last row per key according to sort order.
  • Keep policy: First vs last requires an upstream Sort when order matters.
  • Hash vs key: Full-row unique is cheap mentally; key-based unique matches business keys.
Example: CDC delivers occasional at-least-once duplicates on event_id. Sort by ingest_timestamp descending, then Unique on event_id keeping the first row to retain the latest version. When to use: After merges of streams, before cardinality-sensitive aggregations, or prior to loads that enforce primary keys.
If you need “best record wins” logic beyond first/last (for example, prefer non-null email), use a Window or Summarize pattern upstream of Unique, or move the rule to Custom SQL for clarity.

Explode

Explode converts array elements into separate rows, duplicating non-array columns for each element. Configuration:
  • Explode column: The array column to expand (e.g. tags, line_items).
  • Keep original: Optionally retain the original array column alongside the exploded rows.
Example: A row with tags: ["a", "b", "c"] produces three output rows — one per tag — with all other columns duplicated. When to use: After parsing JSON arrays from REST APIs or event payloads, before aggregating per-element.
Explode multiplies row count — one input row produces N output rows. Place Filter before Explode when possible to limit the expansion.

Z-Order Sort (professional+)

Z-Order Sort reorders rows using a space-filling curve that preserves multi-dimensional data locality. Query engines that read the output files can skip large ranges when filtering on any combination of the sorted columns. Configuration:
  • Columns: Two or more columns to include in the z-order sort.
Example: For an orders table filtered by region and order_date, z-ordering on both columns keeps related data co-located so queries skip 70–85% of files. When to use: Before writing to Iceberg or Delta Lake destinations, especially when downstream queries filter on multiple columns simultaneously. See the full Z-Order Sort reference for column selection tips and performance impact.

Patterns on the canvas

1

Shrink early

Place Filter and column projection (where available) close to the source to save compute on joins.
2

Sort before deterministic dedupe

When duplicate resolution depends on time or version columns, Sort then Unique.
3

Validate with preview

After Sample, confirm distributions still reflect production skew if you use samples for QA sign-off.

Column transforms

Change types, derive fields, and apply window logic.

Aggregation

Roll up after you have narrowed and cleaned rows.

Z-Order Sort reference

Full guide to multi-dimensional sort configuration and performance.

Explode reference

Detailed explode configuration and patterns.