Page MenuHomePhabricator

Bug 1752120: Null out the mWorkerPrivate on WorkerGlobalScopeBase when a worker ends. r?#dom-worker-reviewers
ClosedPublic

Authored by jstutte on Feb 10 2022, 2:00 PM.
Referenced Files
F20710521: D138442.1747196207.diff
Tue, May 13, 4:16 AM
Unknown Object (File)
Sun, May 4, 8:02 AM
Unknown Object (File)
Thu, May 1, 4:34 AM
Unknown Object (File)
Thu, Apr 17, 7:08 AM
Unknown Object (File)
Wed, Apr 16, 7:46 PM
Unknown Object (File)
Apr 9 2025, 7:41 PM
Unknown Object (File)
Apr 1 2025, 5:26 AM
Unknown Object (File)
Mar 1 2025, 4:43 AM

Diff Detail

Repository
rMOZILLACENTRAL mozilla-central
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
phab-bot changed the visibility from "Custom Policy" to "Custom Policy".
phab-bot changed the edit policy from "Custom Policy" to "Custom Policy".
phab-bot added a project: Restricted Project.

This alone makes break some tests: https://treeherder.mozilla.org/logviewer?job_id=367398337&repo=try&lineNumber=2804

I assume we now run into a nullptr problem when WorkerGlobalScope objects are released.

jstutte updated this revision to Diff 538288.
jstutte updated this revision to Diff 538293.
dom/workers/WorkerScope.cpp
191–192

It seems to happen regularly even in our tests, so while we would want to assert that we have gone away before the WorkerPrivate died, we want to be resilient here for now.

That is because we schedule GC/CC here, https://searchfox.org/mozilla-central/rev/38652b98c6dd3bf42403eeb8c5305902b9a6e938/dom/workers/RuntimeService.cpp#2187.
What we can do is to run GC/CC before we make it as nullptr. Then we can have no de-ref checking on GC/CC functions.

That is because we schedule GC/CC here, https://searchfox.org/mozilla-central/rev/38652b98c6dd3bf42403eeb8c5305902b9a6e938/dom/workers/RuntimeService.cpp#2187.
What we can do is to run GC/CC before we make it as nullptr. Then we can have no de-ref checking on GC/CC functions.

Can you elaborate where we would trigger this GC/CC? Until we keep the reference data->mScope it would not get collected and once we nulled it we cannot null the mWorkerPrivate anymore as it can go away immediately? Am I missing something?

Wait, you are saying to just trigger the GC/CC before we leave DoRunLoop and not null out the mWorkerPrivate ? That might have some effect

dom/workers/WorkerPrivate.cpp
3072–3076

I mean we can do https://searchfox.org/mozilla-central/rev/38652b98c6dd3bf42403eeb8c5305902b9a6e938/dom/workers/RuntimeService.cpp#2178-2197 here.

what it should be

RefPtr<WorkerGlobalScope> scope = data->mScope.forget();
RefPtr<WorkerDebuggerGlobalScope> debugScope = data->mDebuggerScope.forget();

ClearDebuggerEventQueue();
JS_GC(cx, JS::GCReason::WORKER_SHUTDOWN);
ClearMainEventQueue(WorkerPrivate::WorkerRan);

scope->mWorkerPrivate = nullptr;
debugScope->mWorkerPrivate = nullptr;

I have not yet tested in this way, but the idea is trying to call GC/CC before we set Scope->mWorkerPrivate as nullptr to support GC/CC can collect corresponding objects.

Then after this WorkerGlobalScope should be invalid anymore. If there is something still associated on it we will hit assertion when we schedule another GC/CC at https://searchfox.org/mozilla-central/rev/38652b98c6dd3bf42403eeb8c5305902b9a6e938/dom/workers/RuntimeService.cpp#2187

dom/workers/WorkerPrivate.cpp
3072–3076

And we can remove de-ref checking in cycle-collect related functions.

dom/workers/WorkerPrivate.cpp
3072–3076

Thanks for the detailed proposal. I just understood where my mental model was incomplete - if we hold the last strong reference to WorkerGlobalScope the traverse functions will not be called at all when we drop it.

Still I am not sure we should remove the de-ref checking there. I'd just assert in addition. If this happens in release, nothing really bad is happening, IMHO, we just do the cleanup in the wrong order.

jstutte updated this revision to Diff 538398.
dom/workers/WorkerPrivate.cpp
3085

@edenchuang Was there a reason you had this before ClearMainEventQueue in your proposal? It feels more sound to me to do it afterwards, but I do not have any expertise here.

jstutte retitled this revision from Bug 1752120: Null out the mWorkerPrivate on WorkerGlobalScopeBase when a worker ends. r?#dom-worker-reviewers to WIP: Bug 1752120: Null out the mWorkerPrivate on WorkerGlobalScopeBase when a worker ends..
jstutte updated this revision to Diff 538661.
jstutte retitled this revision from WIP: Bug 1752120: Null out the mWorkerPrivate on WorkerGlobalScopeBase when a worker ends. to Bug 1752120: Null out the mWorkerPrivate on WorkerGlobalScopeBase when a worker ends. r?#dom-worker-reviewers.

Try: https://treeherder.mozilla.org/jobs?repo=try&revision=0990f6a91b8118c3b7bd80062c2e27fddacdfaa9

If we land this, at least we should be safe against UAF on WorkerPrivate through WorkerGlobalScopeobjects, and such attempts would result in a nullptr access. To be clear, we have no evidence or sign that such UAF really occur in the wild.

We will want a follow up bug to examine the order of cleanup during worker shutdown.

dom/workers/WorkerScope.cpp
192

I think I know why the previous proposal doesn't work.
The reason is using RefPtrs to store WorkerGlobalScope. And it makes GC/CC not release WorkerGlobalScope as expected.
But if we don't use raw pointers instead of RefPtrs, WorkerGlobalScope will be invalid and can not be accessed after GC/CC in normal cases. Then means we can not call WorkerGlobalScope->mWorkerPrivate = nullptr;

I think this current patch could cause some memory leaks if at the moment we still have some TimeOut saved in the WorkerPrivate.
Because WorkerGlobalScope->mWorkerPrivate is set as null before GC/CC traverse to it.

Yeah, that sounds plausible. Can we just call UnlinkTimeouts(); here?

The reason is using RefPtrs to store WorkerGlobalScope. And it makes GC/CC not release WorkerGlobalScope as expected.

Hmm, I'd have expected the GC to at least remove all other references to WorkerGlobalScope such that we free it when we null it. Such that after we drop it, there should be no reference left. Is there any WorkerPrivate member that might keep it alive indirectly such that the WorkerPrivate needs to die first?

Adjusted comments to look less alarming.

dom/workers/WorkerPrivate.cpp
3086–3094

@edenchuang If we just unlink the timeouts here, do we actually need the GC and clear queues here? We would just GC the global scope whenever suitable without problems.

Restating:

Once we get into the killing state of DoRunLoop, we do:

  • Cancel the timeouts
  • Unlink the timeouts
  • Unroot the global scope objects and null out their mWorkerPrivate pointers to avoid further accesses

This way the GC/CC of WorkerGlobalScopeBase objects can happen anytime.

  • If it happens before we get into the killing state (which probably means we did never even run the worker), we unlink the timeouts there
  • Otherwise we will find mWorkerPrivate nullptr and can just ignore it.

It feels like there should be a way to entirely remove the traverse/unlink on mWorkerPrivate from WorkerGlobalScopeBase, but that seems not obvious (to me) and goes beyond this patch, probably.

I am okay with the patch, but we probably need a test for timeout lives during shutdown to make sure we are not leaking.

This looks quite good. I was looking at whether we need to call nsIGlobalObject::UnlinkObjectsInGlobal but it doesn't look like we do. Even a late global object cleanup should probably end up okay based on current logic.

Thank you!

This revision is now accepted and ready to land.Feb 16 2022, 5:12 AM

This revision requires a Testing Policy Project Tag to be set before landing. Please apply one of testing-approved, testing-exception-unchanged, testing-exception-ui, testing-exception-elsewhere, testing-exception-other. Tip: this Firefox add-on makes it easy!

Re: the testing flags, I will leave that to @edenchuang as it sounds like he has something in mind for checking if there's potentially some kind of leak happening, but it's not immediately clear to me what that is. (Would that be setting a really long timeout then causing the worker to shutdown, possibly via self.close()?).

As discussed with @edenchuang I added a new variant of the (existing) test_clearTimouts that does not explicitely close the worker. The test produces:

0:45.95 EXPECTED-FAIL The author of the test has indicated that flaky timeouts are expected.  Reason: untriaged
...
0 INFO TEST-START | Shutdown
1 INFO Passed:  3
2 INFO Failed:  0
3 INFO Todo:    1
4 INFO Mode:    e10s
5 INFO SimpleTest FINISHED

I am not 100% sure if this is the expected result.

Just wondering, does this fully replace D137792 or is it also needed?

Just wondering, does this fully replace D137792 or is it also needed?

No, it just makes the consequences of not having it less scary.

phab-bot changed the edit policy from "Custom Policy" to "Custom Policy".
phab-bot added a project: Restricted Project.
jstutte added a commit: Restricted Diffusion Commit.Mar 8 2022, 10:22 AM
phab-bot changed the visibility from "Custom Policy" to "Public (No Login Required)".Aug 28 2022, 5:23 AM
phab-bot changed the edit policy from "Custom Policy" to "Restricted Project (Project)".
phab-bot removed projects: Restricted Project, Restricted Project, secure-revision.