Skip to content

[SPARK-57829][SQL] Support window, session_window and window_time over nanosecond-precision timestamps#56951

Open
yadavay-amzn wants to merge 2 commits into
apache:masterfrom
yadavay-amzn:SPARK-57829
Open

[SPARK-57829][SQL] Support window, session_window and window_time over nanosecond-precision timestamps#56951
yadavay-amzn wants to merge 2 commits into
apache:masterfrom
yadavay-amzn:SPARK-57829

Conversation

@yadavay-amzn

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Extends window, session_window, and window_time to accept nanosecond-precision timestamp columns (TimestampNTZNanosType, TimestampLTZNanosType). Window boundaries are computed on the nanosecond scale: durations are scaled x1000 with Math.multiplyExact, timestamp <-> epoch-nanos round-trips go through PreciseTimestampNanosConversion, and window_time subtracts 1 nanosecond (vs 1 microsecond for micros input). The window start/end preserve the input's nanosecond precision.

This also updates the batch session_window aggregation iterators (MergingSessionsIterator / UpdatingSessionsIterator) to read and compare session boundaries via the nanosecond-aware TimestampNanosVal accessors through a shared SessionNanosHelper. These sort-merge iterators are on the shared (non-streaming) batch aggregation path, and getLong would misread the variable-length TimestampNanosVal representation - so the feature cannot function correctly without them.

Why are the changes needed?

Part of nanosecond-precision timestamp support (SPARK-56822), building on the event-time watermark support in SPARK-57830. Previously window / session_window / window_time rejected nanosecond-precision timestamp columns.

Does this PR introduce any user-facing change?

Yes - window, session_window, and window_time now accept nanosecond-precision timestamp columns, and the resulting window bounds preserve nanosecond precision.

How was this patch tested?

New DataFrameTimeWindowingSuite and DataFrameSessionWindowingSuite tests for NTZ and LTZ nanosecond inputs, including value-level checkAnswer assertions for both (window bounds, session merging, and window_time at nanosecond granularity). All existing window/session tests pass.

Was this patch authored or co-authored using generative AI tooling?

Authored with assistance by Claude Opus 4.8.

import org.apache.spark.sql.catalyst.util.DateTimeConstants.{NANOS_PER_DAY, NANOS_PER_MICROS}

override def nullIntolerant: Boolean = true
override def inputTypes: Seq[AbstractDataType] = Seq(CalendarIntervalType)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't support the legacy type CalendarIntervalType.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - replaced the custom CalendarIntervalToNanos with CalendarIntervalToMicros typed as DayTimeIntervalType, so the nanosecond session-end now goes through the supported, timezone-aware TimestampAddInterval path. No legacy CalendarIntervalType in the nanos computation; window/session tests pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants