Fix instant-DDL deadlock on GhostTableMigrated changelog signal#1736
Open
peterbollen wants to merge 1 commit into
Open
Fix instant-DDL deadlock on GhostTableMigrated changelog signal#1736peterbollen wants to merge 1 commit into
peterbollen wants to merge 1 commit into
Conversation
When --attempt-instant-ddl succeeds, Migrate() returns early after finalCleanup() without ever receiving from the ghostTableMigrated channel. initiateApplier writes a GhostTableMigrated changelog row, and the streamer's listener callback (onChangelogStateEvent) publishes that signal synchronously via SendWithContext while holding EventsStreamer.listenersMutex. On the normal path a receiver drains it, but the instant-DDL path skips that receive, so the send blocks forever holding the mutex. finalCleanup() then closes the binlog reader, whose rows-event decode callback (shouldDecodeRowsEvent) needs the same mutex, and Close() waits for that goroutine to exit -> permanent deadlock. Fix: drain the GhostTableMigrated signal on the instant-DDL success path before finalCleanup, mirroring the existing receive on the normal path. Resume migrations never emit the signal, so draining is guarded by !Resume. Adds regression tests: a migrator-level test for drainGhostTableMigrated (drains the signal and unblocks the publisher; skips for resume migrations) and a streamer-level test that reproduces the exact deadlock and proves draining resolves it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related issue: #1735
Description
Fixes a deadlock that hangs gh-ost during cleanup when
--attempt-instant-ddlis used and the instantALTERsucceeds.initiateApplierwrites aGhostTableMigratedchangelog row. The streamer's changelog listener callback (onChangelogStateEvent) publishes that signal synchronously viabase.SendWithContext(ctx, ghostTableMigrated, true)while holdingEventsStreamer.listenersMutex(the send runs insidenotifyListeners, which holds the mutex for the whole callback).On the normal path there is a matching
<-mgtr.ghostTableMigratedreceiver. The instant-DDL success path, however, returns early right afterfinalCleanup()and never receives, so the send blocks forever holdinglistenersMutex.finalCleanup()then closes the binlog reader, whose rows-event decode callback (shouldDecodeRowsEvent) needs the same mutex, andBinlogSyncer.Close()waits for that goroutine to exit → permanent deadlock.Fix: drain the
GhostTableMigratedsignal on the instant-DDL success path beforefinalCleanup(), mirroring the receive on the normal path. The drain is extracted intoMigrator.drainGhostTableMigrated()and guarded by!Resume, since resume migrations never emit the signal (initiateApplieronly writes it when!Revert && !Resume).Tests
TestMigratorDrainGhostTableMigrated— the drain consumes the signal and unblocks the publisher; it is skipped for resume migrations.TestEventsStreamerInstantDDLDeadlockIsResolvedByDraining— reproduces the exact deadlock (listener callback blocked on the send while holdinglistenersMutex;shouldDecodeRowsEventblocked on the same mutex) and proves that draining the signal resolves it.Both pass with
-race.script/cibuildreturns with no formatting errors, build errors or unit test errors.