Skip to content

Python: unpin legacy CFG/ESSA from the AST cached stage (DCA measurement)#22116

Draft
yoff wants to merge 1 commit into
yoff/python-shared-cfg-dataflow-flipfrom
yoff/python-unpin-legacy-cfg-from-ast-stage
Draft

Python: unpin legacy CFG/ESSA from the AST cached stage (DCA measurement)#22116
yoff wants to merge 1 commit into
yoff/python-shared-cfg-dataflow-flipfrom
yoff/python-unpin-legacy-cfg-from-ast-stage

Conversation

@yoff

@yoff yoff commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Note

Draft for measurement. This PR exists to run DCA and quantify how much of the shared-CFG dataflow flip's overhead is recovered by not computing the legacy CFG. It is based on the flip branch (yoff/python-shared-cfg-dataflow-flip, #21925), so the diff is exactly the one-commit change and the DCA a/b isolates the effect of the unpin.

What

The legacy CFG (Flow.qll) and legacy ESSA (Essa/SsaCompute/SsaDefinitions) are pinned into the always-on Stages::AST cached stage via Stages::AST::ref() (11 sites) and the matching Stages::AST::backref() disjuncts.

backref() is referenced nowhere and optimizes to 1 = 1, so the ref/backref pattern only controls stage assignment. Because a cached stage is materialised as a unit once any of its predicates is demanded — and every query demands e.g. Expr.toString() — this forces the legacy CFG/ESSA to be computed for every query.

After the shared-CFG dataflow flip, the security/dataflow queries no longer depend on the legacy CFG at all, so on the flip branch this computation is pure dead weight.

This PR removes those pins. Since Stages::AST::ref() is 1 = 1, the change is result-preserving — it only changes stage scheduling.

Verification (result-preserving)

  • Full python-security-extended suite (52 queries) and django: legacy CFG/ESSA predicate families materialised drop from ~165 → 0, with byte-identical results.
  • Cold, fresh-DB django: 59.9s → 57.6s wall (-j4); ~9.5s of legacy predicate self-time removed (≈1.3% of total analysis CPU).

Question DCA should answer

The legacy CFG the security queries drag in (~9.5s CPU, 1.3%) is a fraction of the new CFG+SSA the flip adds (~37.6s CPU, 5.3%), so a-priori this recovers only ~a quarter of the new-CFG cost, not the full flip overhead. DCA across the target set will give the authoritative per-project number.

Caveats before this could merge

  • This is a global stage change. Security queries win; the points-to/quality queries (which genuinely use legacy ESSA) will still compute it, but now in their own stage(s) rather than sharing the AST stage — need to confirm no regression there.
  • A cleaner design would pin legacy CFG/ESSA to an explicit dedicated stage instead of leaving placement to the optimizer.

The legacy CFG (`Flow.qll`) and legacy ESSA (`Essa`/`SsaCompute`/
`SsaDefinitions`) were pinned into the always-on `Stages::AST` cached stage
via `Stages::AST::ref()` and the matching `backref()` disjuncts. Because a
cached stage is materialized as a unit once any of its predicates is demanded
(and every query demands e.g. `Expr.toString()`), this forced the legacy
CFG/ESSA to be computed for *every* query -- including the security/dataflow
queries, which after the shared-CFG dataflow flip no longer depend on the
legacy CFG at all.

Since `Stages::AST::ref()` is `1 = 1`, removing it is result-preserving; it
only changes stage scheduling. After this change the legacy CFG/ESSA is no
longer materialised for queries that do not genuinely reference it. Verified
on the full `python-security-extended` suite and on django: legacy CFG/ESSA
families materialised drop from ~165 to 0 with byte-identical results.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions Bot added the Python label Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant