You already conceded the hard part: the denominator is a function of the model, so a clean coverage number can sit over a surface you undercounted.
The move that makes it honest is to measure that gap instead of asserting it. You cannot see the true surface. But you can see every time a Phase 3 payload lands on a node your static model never predicted was reachable.
Call it a surprise rate. It is the empirical proxy for how wrong the denominator is. A low surprise rate earns the coverage number. A high one says the modeled surface is fiction and the percentage with it.
Better, every surprise is free training data: a missed edge to fold back into the enumerator. The surface model gets falsified by its own execution.
Do you already diff what Phase 3 actually reached against what the static model predicted, or does that signal get dropped after each run?