Overview
Streaming lake ingestion enables continuous, near-real-time data landing into managed lakehouse tables. Instead of waiting for a pipeline run to complete, the destination commits data at configurable intervals — creating a new Iceberg snapshot and Delta version with each micro-batch.How It Works
In streaming mode, two thresholds govern when a commit occurs:| Threshold | Default | Description |
|---|---|---|
| Commit interval | 60 seconds | Maximum wall-clock time between commits |
| Row limit | 100,000 rows | Maximum rows buffered before forced commit |
- Flushes the Parquet buffer to cloud storage
- Creates an Iceberg snapshot with manifest entries
- Creates a Delta log version with
addactions - Resets the buffer and timers
Arrow-Native Path
For maximum throughput, streaming mode operates on Arrow RecordBatches directly — avoiding row-by-row serialization. Sources that emit Arrow natively (Kafka with Arrow encoding, Parquet files, JDBC Arrow flight) skip the intermediate record conversion entirely.Configuration
Enable streaming in the Managed Lakehouse destination node settings:- Expand Advanced Settings
- Toggle Streaming micro-batch mode on
- Set the Commit interval (seconds)
- Set the Row limit
Backpressure
When storage upload latency increases — due to network congestion, throttling, or large batches — the engine automatically reduces batch sizes to prevent memory pressure:| Avg Commit Latency | Batch Size | Action |
|---|---|---|
| < 2 seconds | 10,000 (default) | Normal operation |
| 2–5 seconds | 5,000 | Moderate backpressure — halved batch |
| > 5 seconds | 2,000 | Heavy backpressure — warning logged |
GetBackpressureProfile() method on the destination provides these signals to the pipeline engine, which adjusts ArrowTuning accordingly.
Monitoring
Pipeline Run Summary
Streaming metrics appear in the run summary:Key Metrics
| Metric | What It Tells You |
|---|---|
totalCommits | Total micro-batch commits during the run |
totalRows | Total rows written across all commits |
pendingRows | Rows currently buffered awaiting the next commit |
avgCommitLatency | Average time per commit (write + catalog update) |
Interaction with Other Features
Compaction
Streaming creates many small Parquet files. The maintenance scheduler automatically compacts these into larger files during the weekly compaction cycle. For high-throughput tables, consider triggering manual compaction more frequently.Z-Order Sort
When z-order sort is configured alongside streaming mode, each micro-batch is z-ordered independently before writing. Global z-ordering across batches is achieved through compaction.Data Contracts
Contract validation runs on every micro-batch. Inblock mode, invalid records from each batch are routed to the DLQ, and only valid records are committed.
Tier Limits
| Professional | Premium | Enterprise | |
|---|---|---|---|
| Streaming tables | 2 | 10 | Unlimited |
| Min commit interval | 5 min | 1 min | 10 sec |
Best Practices
Start conservative
Begin with 60-second intervals and adjust based on observed latency and downstream freshness requirements.
Watch memory
If the row limit is very high, each buffered batch consumes memory. Balance between commit frequency and memory usage.
Enable compaction
Streaming generates many small files. Ensure table maintenance is enabled for automatic compaction.
Arrow-native sources
Pair streaming with Arrow-native sources (Kafka, Parquet) for the highest throughput.
Related
Table Maintenance
Compaction optimizes small files from streaming
Streaming overview
General streaming and CDC capabilities