[SPARK-57839][SQL] Support the nanosecond-precision timestamp types in CBO statistics estimation by yadavay-amzn · Pull Request #56941 · apache/spark

yadavay-amzn · 2026-07-01T20:27:03Z

What changes were proposed in this pull request?

Adds cost-based optimizer (CBO) statistics estimation support for the nanosecond-precision timestamp types (TimestampNTZNanosType, TimestampLTZNanosType), mirroring the existing microsecond TimestampType handling at every CBO site: EstimationUtils.toDouble/fromDouble, FilterEstimation.evaluateBinary/evaluateInSet, UnionEstimation.isTypeSupported, and CatalogColumnStat min/max (de)serialization. Values are converted on the nanosecond scale (epochMicros * 1000 + nanosWithinMicro). Column-stat min/max are serialized with a fraction-preserving 9-digit nanosecond formatter so they round-trip losslessly.

It also excludes the nanosecond timestamp types from histogram collection (CommandUtils.supportsHistogram) - the histogram aggregates do not yet accept them - so ANALYZE TABLE ... FOR COLUMNS on a nanosecond-precision column collects basic statistics (min/max/ndv) without error.

Why are the changes needed?

Filters/joins/unions over nanosecond-precision timestamp columns previously fell back to default selectivity because these types were not among the types CBO estimates. They have a natural numeric ordering, so they can be estimated the same way as microsecond timestamps.

Does this PR introduce any user-facing change?

No. Affects CBO estimation / plan costing only; query results are unchanged.

How was this patch tested?

New FilterEstimationSuite tests (range / = / IN selectivity) for the nanosecond types, nanosecond columns added to the Union/Join estimation tests, and a dedicated lossless min/max round-trip test in StatisticsCollectionSuite (a sub-microsecond value survives toExternalString -> fromExternalString). All CBO estimation suites pass. An ANALYZE TABLE ... FOR COLUMNS test on nanosecond columns confirms basic stats are collected and histograms are skipped.

Was this patch authored or co-authored using generative AI tooling?

Authored with assistance by Claude Opus 4.8.

…n CBO statistics estimation

…ction; add LTZ/sub-micro estimation tests

MaxGekk

0 blocking, 2 non-blocking, 0 nits.
Correct, complete extension of CBO estimation to the nanosecond timestamp types, mirroring micro TimestampType at every site and correctly excluding nanos from histogram collection (the AnyTimestampNanoType => false case is rightly ordered before the DatetimeType catch-all). Persisted catalog stats are exact (lossless 9-digit formatter, verified by the round-trip test). Part of the SPARK-56822 nanos umbrella. Both items below are non-blocking.

Correctness (1)

EstimationUtils.scala:156: the join-key stat intersect reconstructs nanos min/max via the lossy fromDouble (~256ns imprecision at 2024 magnitudes), unlike micro TimestampType which is exact there — perturbs estimated join-output stats only (never a query result, never persisted to the catalog). See inline.

Suggestions (1)

FilterEstimationSuite.scala:1035: the nanos tests use small epochMicros (1000–1009) where toDouble is exact, so the documented lossy large-magnitude regime is untested — add a realistic 2024-era case plus an assert(max > min) collapse guard. See inline.

Verification

Traced the precision story per path. Filter binary and IN persist the original min/max (Some(literal.value) / the original set elements), so they are exact. The join-intersect path (JoinEstimation → ValueInterval.intersect → fromDouble) does reconstruct nanos bounds lossily and propagate them into estimated join-output stats — but estimation-only, never catalog or result. Catalog min/max serialize through a separate lossless 9-digit formatter (round-trip test with nanosWithinMicro=789). Also proved assert(max > min) (FilterEstimation) is unreachable when two nanos bounds collapse to one double: for every comparison op the no-overlap/complete-overlap guards are exhaustive at min == max, and toDouble is monotonic so max < min is impossible.

PR description suggestions

Scope the "serialized … so they round-trip losslessly" statement to catalog serialization — it is precise there, but the join-intersect estimation path reconstructs nanos bounds through the lossy fromDouble and is not lossless.

MaxGekk · 2026-07-02T10:17:09Z

      case TimestampType => double.toLong
+      case _: AnyTimestampNanoType =>
+        val nanos = double.toLong
+        TimestampNanosVal.fromParts(Math.floorDiv(nanos, 1000L), Math.floorMod(nanos, 1000).toShort)


Non-blocking. This fromDouble nanos arm is lossy at the nanos scale (epochMicros*1000+nanosWithinMicro exceeds 2^53 for real-world dates; ULP ~256ns at 2024), which the toDouble comment above notes. On the Filter paths that's harmless — the updated min/max are the original values, not fromDouble reconstructions. But the join path does round-trip through here: JoinEstimation → ValueInterval.intersect (ValueInterval.scala:90-91) calls fromDouble on the intersected nanos min/max, and those flow into the join output's estimated ColumnStat (JoinEstimation.scala:253), so a joined nanos key's estimated bounds can be perturbed by up to ~256ns. Micro TimestampType (double.toLong) is exact there (1.7e15 < 2^53), so this is a nanos-only divergence.

This is estimation-only (never a query result, never persisted to the catalog — catalog min/max come from ANALYZE via the lossless formatter), well below CBO's noise floor, and matches your "Acceptable for estimation only" note — so not a blocker. Consider a one-line comment on the join/intersect path (or here, noting the reconstructed value is used for join-stat propagation) so the approximation is documented where the value is actually persisted into stats, not only at the conversion site.

Done in the latest commit: added a matching lossy-precision comment on the fromDouble nanos arm (referencing the toDouble comment above), and added two high-magnitude filter estimation tests (NTZ + LTZ) using a 2024 timestamp (epochMicros ~1.7e15, so epoch-nanos ~1.7e18, well past 2^53) that exercise the lossy toDouble/fromDouble path and assert range selectivity is still computed correctly.

MaxGekk · 2026-07-02T10:17:09Z

    }
  }

+  // Tests for nanosecond-precision timestamp types (SPARK-57839)


Non-blocking test-coverage suggestion. These nanos cases use epochMicros 1000–1009, where toDouble is exact — so the lossy large-magnitude regime the toDouble comment warns about has no coverage. Two high-value additions:

A range/IN case with realistic 2024-era nanos (epochMicros ≈ 1.7e15, distinct nanosWithinMicro) asserting selectivity stays sane — this is what would exercise the actual lossy path.

An assert(max > min) regression guard: a partial-overlap predicate where two stored nanos bounds are within one ULP (collapse to the same double), proving no AssertionError. The branch is safe today (the overlap guards make it unreachable at min == max), but a test locks that against future edits.

Two optional hardening cases: a JoinEstimationSuite equi-join with 2024-era bounds within ~256ns asserting the estimated output ColumnStat.min/max (documents the imprecision from the other comment), and a negative / pre-1970 nanos catalog round-trip exercising fromDouble's floorDiv/floorMod (the current round-trip test uses a positive 2022 value).

…nd fromDouble lossy comment

yadavay-amzn added 2 commits July 1, 2026 20:19

[SPARK-57839][SQL] Support the nanosecond-precision timestamp types i…

660f2c8

…n CBO statistics estimation

[SPARK-57839][SQL] Exclude nanosecond timestamps from histogram colle…

9b8f2c9

…ction; add LTZ/sub-micro estimation tests

MaxGekk reviewed Jul 2, 2026

View reviewed changes

[SPARK-57839][SQL] Add high-magnitude nanos filter estimation tests a…

b82965d

…nd fromDouble lossy comment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-57839][SQL] Support the nanosecond-precision timestamp types in CBO statistics estimation#56941

[SPARK-57839][SQL] Support the nanosecond-precision timestamp types in CBO statistics estimation#56941
yadavay-amzn wants to merge 3 commits into
apache:masterfrom
yadavay-amzn:SPARK-57839

yadavay-amzn commented Jul 1, 2026 •

edited

Loading

Uh oh!

MaxGekk left a comment

Uh oh!

MaxGekk Jul 2, 2026

Uh oh!

yadavay-amzn Jul 2, 2026

Uh oh!

MaxGekk Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yadavay-amzn commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Correctness (1)

Suggestions (1)

Verification

PR description suggestions

Uh oh!

MaxGekk Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

yadavay-amzn Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

MaxGekk Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yadavay-amzn commented Jul 1, 2026 •

edited

Loading