Details
- Reviewers
asuth - Group Reviewers
dom-worker-reviewers - Commits
- Restricted Diffusion Commit
rMOZILLACENTRALc0922a11dfc9: Bug 1752120: Null out the mWorkerPrivate on WorkerGlobalScopeBase when a worker… - Bugzilla Bug ID
- 1752120
Diff Detail
- Repository
- rMOZILLACENTRAL mozilla-central
- Lint
Lint Not Applicable - Unit
Tests Not Applicable
Event Timeline
This alone makes break some tests: https://treeherder.mozilla.org/logviewer?job_id=367398337&repo=try&lineNumber=2804
I assume we now run into a nullptr problem when WorkerGlobalScope objects are released.
dom/workers/WorkerScope.cpp | ||
---|---|---|
191–192 | It seems to happen regularly even in our tests, so while we would want to assert that we have gone away before the WorkerPrivate died, we want to be resilient here for now. |
That is because we schedule GC/CC here, https://searchfox.org/mozilla-central/rev/38652b98c6dd3bf42403eeb8c5305902b9a6e938/dom/workers/RuntimeService.cpp#2187.
What we can do is to run GC/CC before we make it as nullptr. Then we can have no de-ref checking on GC/CC functions.
Can you elaborate where we would trigger this GC/CC? Until we keep the reference data->mScope it would not get collected and once we nulled it we cannot null the mWorkerPrivate anymore as it can go away immediately? Am I missing something?
Wait, you are saying to just trigger the GC/CC before we leave DoRunLoop and not null out the mWorkerPrivate ? That might have some effect
dom/workers/WorkerPrivate.cpp | ||
---|---|---|
3072–3076 | I mean we can do https://searchfox.org/mozilla-central/rev/38652b98c6dd3bf42403eeb8c5305902b9a6e938/dom/workers/RuntimeService.cpp#2178-2197 here. what it should be RefPtr<WorkerGlobalScope> scope = data->mScope.forget(); ClearDebuggerEventQueue(); scope->mWorkerPrivate = nullptr; I have not yet tested in this way, but the idea is trying to call GC/CC before we set Scope->mWorkerPrivate as nullptr to support GC/CC can collect corresponding objects. Then after this WorkerGlobalScope should be invalid anymore. If there is something still associated on it we will hit assertion when we schedule another GC/CC at https://searchfox.org/mozilla-central/rev/38652b98c6dd3bf42403eeb8c5305902b9a6e938/dom/workers/RuntimeService.cpp#2187 |
dom/workers/WorkerPrivate.cpp | ||
---|---|---|
3072–3076 | And we can remove de-ref checking in cycle-collect related functions. |
dom/workers/WorkerPrivate.cpp | ||
---|---|---|
3072–3076 | Thanks for the detailed proposal. I just understood where my mental model was incomplete - if we hold the last strong reference to WorkerGlobalScope the traverse functions will not be called at all when we drop it. Still I am not sure we should remove the de-ref checking there. I'd just assert in addition. If this happens in release, nothing really bad is happening, IMHO, we just do the cleanup in the wrong order. |
dom/workers/WorkerPrivate.cpp | ||
---|---|---|
3085 | @edenchuang Was there a reason you had this before ClearMainEventQueue in your proposal? It feels more sound to me to do it afterwards, but I do not have any expertise here. |
Try: https://treeherder.mozilla.org/jobs?repo=try&revision=0990f6a91b8118c3b7bd80062c2e27fddacdfaa9
If we land this, at least we should be safe against UAF on WorkerPrivate through WorkerGlobalScopeobjects, and such attempts would result in a nullptr access. To be clear, we have no evidence or sign that such UAF really occur in the wild.
We will want a follow up bug to examine the order of cleanup during worker shutdown.
dom/workers/WorkerScope.cpp | ||
---|---|---|
192 | We cannot assert here, as this triggers regularly: https://treeherder.mozilla.org/jobs?repo=try&revision=fb129137d27337003618592b9791bed234989271 |
We cannot always expect to have a global scope object when the worker ends.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=77472a3e5839b27104c5509ee500ebf6d23def26
I think I know why the previous proposal doesn't work.
The reason is using RefPtrs to store WorkerGlobalScope. And it makes GC/CC not release WorkerGlobalScope as expected.
But if we don't use raw pointers instead of RefPtrs, WorkerGlobalScope will be invalid and can not be accessed after GC/CC in normal cases. Then means we can not call WorkerGlobalScope->mWorkerPrivate = nullptr;
I think this current patch could cause some memory leaks if at the moment we still have some TimeOut saved in the WorkerPrivate.
Because WorkerGlobalScope->mWorkerPrivate is set as null before GC/CC traverse to it.
Hmm, I'd have expected the GC to at least remove all other references to WorkerGlobalScope such that we free it when we null it. Such that after we drop it, there should be no reference left. Is there any WorkerPrivate member that might keep it alive indirectly such that the WorkerPrivate needs to die first?
dom/workers/WorkerPrivate.cpp | ||
---|---|---|
3086–3094 | @edenchuang If we just unlink the timeouts here, do we actually need the GC and clear queues here? We would just GC the global scope whenever suitable without problems. |
Restating:
Once we get into the killing state of DoRunLoop, we do:
- Cancel the timeouts
- Unlink the timeouts
- Unroot the global scope objects and null out their mWorkerPrivate pointers to avoid further accesses
This way the GC/CC of WorkerGlobalScopeBase objects can happen anytime.
- If it happens before we get into the killing state (which probably means we did never even run the worker), we unlink the timeouts there
- Otherwise we will find mWorkerPrivate nullptr and can just ignore it.
It feels like there should be a way to entirely remove the traverse/unlink on mWorkerPrivate from WorkerGlobalScopeBase, but that seems not obvious (to me) and goes beyond this patch, probably.
FWIW, this try run looks good: https://treeherder.mozilla.org/jobs?revision=ca4e6c92710f8571c191d44a3acd518c557758de&repo=try
I am okay with the patch, but we probably need a test for timeout lives during shutdown to make sure we are not leaking.
This looks quite good. I was looking at whether we need to call nsIGlobalObject::UnlinkObjectsInGlobal but it doesn't look like we do. Even a late global object cleanup should probably end up okay based on current logic.
Thank you!
This revision requires a Testing Policy Project Tag to be set before landing. Please apply one of testing-approved, testing-exception-unchanged, testing-exception-ui, testing-exception-elsewhere, testing-exception-other. Tip: this Firefox add-on makes it easy!
Re: the testing flags, I will leave that to @edenchuang as it sounds like he has something in mind for checking if there's potentially some kind of leak happening, but it's not immediately clear to me what that is. (Would that be setting a really long timeout then causing the worker to shutdown, possibly via self.close()?).
As discussed with @edenchuang I added a new variant of the (existing) test_clearTimouts that does not explicitely close the worker. The test produces:
0:45.95 EXPECTED-FAIL The author of the test has indicated that flaky timeouts are expected. Reason: untriaged ... 0 INFO TEST-START | Shutdown 1 INFO Passed: 3 2 INFO Failed: 0 3 INFO Todo: 1 4 INFO Mode: e10s 5 INFO SimpleTest FINISHED
I am not 100% sure if this is the expected result.