Skip to content

Fix instant-DDL deadlock on GhostTableMigrated changelog signal#1736

Open
peterbollen wants to merge 1 commit into
github:masterfrom
peterbollen:fix/instant-ddl-changelog-deadlock
Open

Fix instant-DDL deadlock on GhostTableMigrated changelog signal#1736
peterbollen wants to merge 1 commit into
github:masterfrom
peterbollen:fix/instant-ddl-changelog-deadlock

Conversation

@peterbollen

Copy link
Copy Markdown

Related issue: #1735

Description

Fixes a deadlock that hangs gh-ost during cleanup when --attempt-instant-ddl is used and the instant ALTER succeeds.

initiateApplier writes a GhostTableMigrated changelog row. The streamer's changelog listener callback (onChangelogStateEvent) publishes that signal synchronously via base.SendWithContext(ctx, ghostTableMigrated, true) while holding EventsStreamer.listenersMutex (the send runs inside notifyListeners, which holds the mutex for the whole callback).

On the normal path there is a matching <-mgtr.ghostTableMigrated receiver. The instant-DDL success path, however, returns early right after finalCleanup() and never receives, so the send blocks forever holding listenersMutex. finalCleanup() then closes the binlog reader, whose rows-event decode callback (shouldDecodeRowsEvent) needs the same mutex, and BinlogSyncer.Close() waits for that goroutine to exit → permanent deadlock.

Fix: drain the GhostTableMigrated signal on the instant-DDL success path before finalCleanup(), mirroring the receive on the normal path. The drain is extracted into Migrator.drainGhostTableMigrated() and guarded by !Resume, since resume migrations never emit the signal (initiateApplier only writes it when !Revert && !Resume).

Tests

  • TestMigratorDrainGhostTableMigrated — the drain consumes the signal and unblocks the publisher; it is skipped for resume migrations.
  • TestEventsStreamerInstantDDLDeadlockIsResolvedByDraining — reproduces the exact deadlock (listener callback blocked on the send while holding listenersMutex; shouldDecodeRowsEvent blocked on the same mutex) and proves that draining the signal resolves it.

Both pass with -race.

In case this PR introduced Go code changes:

  • contributed code is using same conventions as original code
  • script/cibuild returns with no formatting errors, build errors or unit test errors.

When --attempt-instant-ddl succeeds, Migrate() returns early after
finalCleanup() without ever receiving from the ghostTableMigrated channel.

initiateApplier writes a GhostTableMigrated changelog row, and the streamer's
listener callback (onChangelogStateEvent) publishes that signal synchronously
via SendWithContext while holding EventsStreamer.listenersMutex. On the normal
path a receiver drains it, but the instant-DDL path skips that receive, so the
send blocks forever holding the mutex. finalCleanup() then closes the binlog
reader, whose rows-event decode callback (shouldDecodeRowsEvent) needs the same
mutex, and Close() waits for that goroutine to exit -> permanent deadlock.

Fix: drain the GhostTableMigrated signal on the instant-DDL success path before
finalCleanup, mirroring the existing receive on the normal path. Resume
migrations never emit the signal, so draining is guarded by !Resume.

Adds regression tests: a migrator-level test for drainGhostTableMigrated (drains
the signal and unblocks the publisher; skips for resume migrations) and a
streamer-level test that reproduces the exact deadlock and proves draining
resolves it.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant