Streaming Lake Ingestion

Overview

Streaming lake ingestion enables continuous, near-real-time data landing into managed lakehouse tables. Instead of waiting for a pipeline run to complete, the destination commits data at configurable intervals — creating a new Iceberg snapshot and Delta version with each micro-batch.

How It Works

In streaming mode, two thresholds govern when a commit occurs:

Threshold	Default	Description
Commit interval	60 seconds	Maximum wall-clock time between commits
Row limit	100,000 rows	Maximum rows buffered before forced commit

Whichever threshold is reached first triggers the commit. Each commit:

Flushes the Parquet buffer to cloud storage
Creates an Iceberg snapshot with manifest entries
Creates a Delta log version with add actions
Resets the buffer and timers

Arrow-Native Path

For maximum throughput, streaming mode operates on Arrow RecordBatches directly — avoiding row-by-row serialization. Sources that emit Arrow natively (Kafka with Arrow encoding, Parquet files, JDBC Arrow flight) skip the intermediate record conversion entirely.

Configuration

Enable streaming in the Managed Lakehouse destination node settings:

Expand Advanced Settings
Toggle Streaming micro-batch mode on
Set the Commit interval (seconds)
Set the Row limit

Or via pipeline JSON:

{
  "managedLakehouseSettings": {
    "streamingMode": true,
    "streamingCommitInterval": 60,
    "streamingCommitRowLimit": 100000
  }
}

Backpressure

When storage upload latency increases — due to network congestion, throttling, or large batches — the engine automatically reduces batch sizes to prevent memory pressure:

Avg Commit Latency	Batch Size	Action
< 2 seconds	10,000 (default)	Normal operation
2–5 seconds	5,000	Moderate backpressure — halved batch
> 5 seconds	2,000	Heavy backpressure — warning logged

The GetBackpressureProfile() method on the destination provides these signals to the pipeline engine, which adjusts ArrowTuning accordingly.

Monitoring

Pipeline Run Summary

Streaming metrics appear in the run summary:

{
  "streamingMode": true,
  "totalCommits": 42,
  "totalRows": 4200000,
  "pendingRows": 15000,
  "avgCommitLatency": "850.3ms"
}

Key Metrics

Metric	What It Tells You
`totalCommits`	Total micro-batch commits during the run
`totalRows`	Total rows written across all commits
`pendingRows`	Rows currently buffered awaiting the next commit
`avgCommitLatency`	Average time per commit (write + catalog update)

If avgCommitLatency consistently exceeds 5 seconds, consider increasing the commit interval or scaling your storage throughput.

Interaction with Other Features

Compaction

Streaming creates many small Parquet files. The maintenance scheduler automatically compacts these into larger files during the weekly compaction cycle. For high-throughput tables, consider triggering manual compaction more frequently.

Z-Order Sort

When z-order sort is configured alongside streaming mode, each micro-batch is z-ordered independently before writing. Global z-ordering across batches is achieved through compaction.

Data Contracts

Contract validation runs on every micro-batch. In block mode, invalid records from each batch are routed to the DLQ, and only valid records are committed.

Tier Limits

	Professional	Premium	Enterprise
Streaming tables	2	10	Unlimited
Min commit interval	5 min	1 min	10 sec

Best Practices

Start conservative

Begin with 60-second intervals and adjust based on observed latency and downstream freshness requirements.

Watch memory

If the row limit is very high, each buffered batch consumes memory. Balance between commit frequency and memory usage.

Enable compaction

Streaming generates many small files. Ensure table maintenance is enabled for automatic compaction.

Arrow-native sources

Pair streaming with Arrow-native sources (Kafka, Parquet) for the highest throughput.

Table Maintenance

Compaction optimizes small files from streaming

Streaming overview

General streaming and CDC capabilities

Streaming Ingestion Z-Order Sort

​Overview

​How It Works

​Arrow-Native Path

​Configuration

​Backpressure

​Monitoring

​Pipeline Run Summary

​Key Metrics

​Interaction with Other Features

​Compaction

​Z-Order Sort

​Data Contracts

​Tier Limits

​Best Practices

Start conservative

Watch memory

Enable compaction

Arrow-native sources

​Related

Table Maintenance

Streaming overview

Overview

How It Works

Arrow-Native Path

Configuration

Backpressure

Monitoring

Pipeline Run Summary

Key Metrics

Interaction with Other Features

Compaction

Z-Order Sort

Data Contracts

Tier Limits

Best Practices

Related