Skip to main content

Overview

Column statistics are automatically computed during every Parquet write and stored in both Iceberg snapshot properties and Delta Lake add action metadata. Query engines use these statistics for predicate pushdown — skipping entire data files that cannot contain matching rows.

What’s Computed

StatisticDescriptionStorage
Min valueSmallest value in the columnIceberg manifest + Delta stats JSON
Max valueLargest value in the columnIceberg manifest + Delta stats JSON
Null countNumber of null valuesIceberg manifest + Delta stats JSON
Distinct estimateApproximate unique valuesIceberg snapshot properties

How It Improves Query Performance

When a query engine reads an Iceberg table or Delta table, it checks the per-file column statistics before downloading data files. If a file’s max value for date is 2025-12-31, any query filtering date > 2026-01-01 skips the file entirely. This can reduce I/O by 90%+ for selective queries.

Viewing Column Stats

Navigate to Managed Lakehouse → select a table → Column Stats tab. The Column Stats panel displays:
ColumnTypeMinMaxNullsDistinct
user_idint641982,4510245,112
event_typestringclickview128
created_attimestamp2025-01-012026-04-090456,321
amountfloat640.019,999.991,2038,742

API

GET /api/managed-lakehouse/tables/{tableId}/column-stats

Response

{
  "columns": [
    {
      "name": "user_id",
      "type": "int64",
      "min": "1",
      "max": "982451",
      "nullCount": 0,
      "distinctEstimate": 245112
    }
  ],
  "snapshotId": 7834921,
  "computedAt": "2026-04-09T14:30:00Z"
}

Thresholds

Tables with more than 200 columns only compute statistics for primary keys, partition columns, sort columns, and system columns to keep commit latency low.
Table WidthColumns Profiled
≤ 200 columnsAll columns
> 200 columnsKeys, partition, sort, and system columns only

Storage Details

Iceberg

Statistics are stored as snapshot-level properties with keys:
planasonix.stats.<column>.min
planasonix.stats.<column>.max
planasonix.stats.<column>.null_count
planasonix.stats.<column>.distinct_count

Delta Lake

Statistics are embedded in the stats JSON field of each add action in the transaction log:
{
  "numRecords": 50000,
  "minValues": { "user_id": 1, "amount": 0.01 },
  "maxValues": { "user_id": 982451, "amount": 9999.99 },
  "nullCount": { "user_id": 0, "amount": 1203 }
}

Best Practices

Use with Z-Order

Z-ordered data has tighter min/max ranges per file, making statistics-based pruning even more effective.

Compact regularly

Compaction recalculates statistics across merged files for up-to-date min/max bounds.

Z-Order Sort

Improve data locality for multi-column queries

Table Maintenance

Compaction refreshes column statistics automatically