dom/workers/WorkerScope.cpp
191–192	It seems to happen regularly even in our tests, so while we would want to assert that we have gone away before the `WorkerPrivate` died, we want to be resilient here for now.

That is because we schedule GC/CC here, https://searchfox.org/mozilla-central/rev/38652b98c6dd3bf42403eeb8c5305902b9a6e938/dom/workers/RuntimeService.cpp#2187.
What we can do is to run GC/CC before we make it as nullptr. Then we can have no de-ref checking on GC/CC functions.

In D138442#4506731, @edenchuang wrote:

That is because we schedule GC/CC here, https://searchfox.org/mozilla-central/rev/38652b98c6dd3bf42403eeb8c5305902b9a6e938/dom/workers/RuntimeService.cpp#2187.
What we can do is to run GC/CC before we make it as nullptr. Then we can have no de-ref checking on GC/CC functions.

Can you elaborate where we would trigger this GC/CC? Until we keep the reference data->mScope it would not get collected and once we nulled it we cannot null the mWorkerPrivate anymore as it can go away immediately? Am I missing something?

Wait, you are saying to just trigger the GC/CC before we leave DoRunLoop and not null out the mWorkerPrivate ? That might have some effect

edenchuang added inline comments.Feb 10 2022, 5:32 PM

dom/workers/WorkerPrivate.cpp
3072–3076	I mean we can do https://searchfox.org/mozilla-central/rev/38652b98c6dd3bf42403eeb8c5305902b9a6e938/dom/workers/RuntimeService.cpp#2178-2197 here. what it should be RefPtr<WorkerGlobalScope> scope = data->mScope.forget(); RefPtr<WorkerDebuggerGlobalScope> debugScope = data->mDebuggerScope.forget(); ClearDebuggerEventQueue(); JS_GC(cx, JS::GCReason::WORKER_SHUTDOWN); ClearMainEventQueue(WorkerPrivate::WorkerRan); scope->mWorkerPrivate = nullptr; debugScope->mWorkerPrivate = nullptr; I have not yet tested in this way, but the idea is trying to call GC/CC before we set Scope->mWorkerPrivate as nullptr to support GC/CC can collect corresponding objects. Then after this WorkerGlobalScope should be invalid anymore. If there is something still associated on it we will hit assertion when we schedule another GC/CC at https://searchfox.org/mozilla-central/rev/38652b98c6dd3bf42403eeb8c5305902b9a6e938/dom/workers/RuntimeService.cpp#2187

edenchuang added inline comments.Feb 10 2022, 5:35 PM

dom/workers/WorkerPrivate.cpp
3072–3076	And we can remove de-ref checking in cycle-collect related functions.

jstutte added inline comments.Feb 10 2022, 6:05 PM

dom/workers/WorkerPrivate.cpp
3072–3076	Thanks for the detailed proposal. I just understood where my mental model was incomplete - if we hold the last strong reference to `WorkerGlobalScope` the traverse functions will not be called at all when we drop it. Still I am not sure we should remove the de-ref checking there. I'd just assert in addition. If this happens in release, nothing really bad is happening, IMHO, we just do the cleanup in the wrong order.

jstutte planned changes to this revision.Feb 10 2022, 6:08 PM

jstutte requested review of this revision.Feb 10 2022, 6:32 PM

jstutte updated this revision to Diff 538398.

Harbormaster completed remote builds in B396596: Diff 538398.Feb 10 2022, 6:32 PM

Next try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fb279a53a11d8dcfc24e3f3f999bc030150473f7

jstutte added inline comments.Feb 10 2022, 6:50 PM

dom/workers/WorkerPrivate.cpp
3085	@edenchuang Was there a reason you had this before `ClearMainEventQueue` in your proposal? It feels more sound to me to do it afterwards, but I do not have any expertise here.

jstutte planned changes to this revision.Feb 10 2022, 7:00 PM

jstutte updated this revision to Diff 538501.Feb 10 2022, 9:58 PM

jstutte retitled this revision from Bug 1752120: Null out the mWorkerPrivate on WorkerGlobalScopeBase when a worker ends. r?#dom-worker-reviewers to WIP: Bug 1752120: Null out the mWorkerPrivate on WorkerGlobalScopeBase when a worker ends..

Harbormaster completed remote builds in B396675: Diff 538501.Feb 10 2022, 9:58 PM

jstutte planned changes to this revision.Feb 10 2022, 9:58 PM

jstutte requested review of this revision.Feb 11 2022, 7:26 AM

jstutte updated this revision to Diff 538661.

jstutte retitled this revision from WIP: Bug 1752120: Null out the mWorkerPrivate on WorkerGlobalScopeBase when a worker ends. to Bug 1752120: Null out the mWorkerPrivate on WorkerGlobalScopeBase when a worker ends. r?#dom-worker-reviewers.

Harbormaster completed remote builds in B396823: Diff 538661.Feb 11 2022, 7:26 AM

Try: https://treeherder.mozilla.org/jobs?repo=try&revision=0990f6a91b8118c3b7bd80062c2e27fddacdfaa9

If we land this, at least we should be safe against UAF on WorkerPrivate through WorkerGlobalScopeobjects, and such attempts would result in a nullptr access. To be clear, we have no evidence or sign that such UAF really occur in the wild.

We will want a follow up bug to examine the order of cleanup during worker shutdown.

jstutte added inline comments.Feb 11 2022, 7:43 AM

dom/workers/WorkerScope.cpp
192	We cannot assert here, as this triggers regularly: https://treeherder.mozilla.org/jobs?repo=try&revision=fb129137d27337003618592b9791bed234989271

jstutte updated this revision to Diff 538668.Feb 11 2022, 8:28 AM

Harbormaster completed remote builds in B396827: Diff 538668.Feb 11 2022, 8:28 AM

We cannot always expect to have a global scope object when the worker ends.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=77472a3e5839b27104c5509ee500ebf6d23def26

I think I know why the previous proposal doesn't work.
The reason is using RefPtrs to store WorkerGlobalScope. And it makes GC/CC not release WorkerGlobalScope as expected.
But if we don't use raw pointers instead of RefPtrs, WorkerGlobalScope will be invalid and can not be accessed after GC/CC in normal cases. Then means we can not call WorkerGlobalScope->mWorkerPrivate = nullptr;

I think this current patch could cause some memory leaks if at the moment we still have some TimeOut saved in the WorkerPrivate.
Because WorkerGlobalScope->mWorkerPrivate is set as null before GC/CC traverse to it.

Yeah, that sounds plausible. Can we just call UnlinkTimeouts(); here?

In D138442#4509156, @edenchuang wrote:

The reason is using RefPtrs to store WorkerGlobalScope. And it makes GC/CC not release WorkerGlobalScope as expected.

Hmm, I'd have expected the GC to at least remove all other references to WorkerGlobalScope such that we free it when we null it. Such that after we drop it, there should be no reference left. Is there any WorkerPrivate member that might keep it alive indirectly such that the WorkerPrivate needs to die first?

jstutte updated this revision to Diff 538684.Feb 11 2022, 9:52 AM

Harbormaster completed remote builds in B396839: Diff 538684.Feb 11 2022, 9:52 AM

jstutte updated this revision to Diff 538694.Feb 11 2022, 10:00 AM

Harbormaster completed remote builds in B396844: Diff 538694.Feb 11 2022, 10:00 AM

jstutte updated this revision to Diff 538697.Feb 11 2022, 10:07 AM

Harbormaster completed remote builds in B396845: Diff 538697.Feb 11 2022, 10:07 AM

Adjusted comments to look less alarming.

jstutte added inline comments.Feb 11 2022, 10:51 AM

dom/workers/WorkerPrivate.cpp
3086–3094	@edenchuang If we just unlink the timeouts here, do we actually need the GC and clear queues here? We would just GC the global scope whenever suitable without problems.

jstutte updated this revision to Diff 538742.Feb 11 2022, 12:10 PM

Harbormaster completed remote builds in B396889: Diff 538742.Feb 11 2022, 12:10 PM

Restating:

Once we get into the killing state of DoRunLoop, we do:

Cancel the timeouts
Unlink the timeouts
Unroot the global scope objects and null out their mWorkerPrivate pointers to avoid further accesses

This way the GC/CC of WorkerGlobalScopeBase objects can happen anytime.

If it happens before we get into the killing state (which probably means we did never even run the worker), we unlink the timeouts there
Otherwise we will find mWorkerPrivate nullptr and can just ignore it.

It feels like there should be a way to entirely remove the traverse/unlink on mWorkerPrivate from WorkerGlobalScopeBase, but that seems not obvious (to me) and goes beyond this patch, probably.

FWIW, this try run looks good: https://treeherder.mozilla.org/jobs?revision=ca4e6c92710f8571c191d44a3acd518c557758de&repo=try

I am okay with the patch, but we probably need a test for timeout lives during shutdown to make sure we are not leaking.

This looks quite good. I was looking at whether we need to call nsIGlobalObject::UnlinkObjectsInGlobal but it doesn't look like we do. Even a late global object cleanup should probably end up okay based on current logic.

Thank you!

This revision is now accepted and ready to land.Feb 16 2022, 5:12 AM

This revision requires a Testing Policy Project Tag to be set before landing. Please apply one of testing-approved, testing-exception-unchanged, testing-exception-ui, testing-exception-elsewhere, testing-exception-other. Tip: this Firefox add-on makes it easy!

Re: the testing flags, I will leave that to @edenchuang as it sounds like he has something in mind for checking if there's potentially some kind of leak happening, but it's not immediately clear to me what that is. (Would that be setting a really long timeout then causing the worker to shutdown, possibly via self.close()?).

jstutte updated this revision to Diff 540342.Feb 16 2022, 12:43 PM

Harbormaster completed remote builds in B398205: Diff 540342.Feb 16 2022, 12:43 PM

As discussed with @edenchuang I added a new variant of the (existing) test_clearTimouts that does not explicitely close the worker. The test produces:

0:45.95 EXPECTED-FAIL The author of the test has indicated that flaky timeouts are expected.  Reason: untriaged
...
0 INFO TEST-START | Shutdown
1 INFO Passed:  3
2 INFO Failed:  0
3 INFO Todo:    1
4 INFO Mode:    e10s
5 INFO SimpleTest FINISHED

I am not 100% sure if this is the expected result.

Just wondering, does this fully replace D137792 or is it also needed?

In D138442#4522436, @saschanaz wrote:

Just wondering, does this fully replace D137792 or is it also needed?

No, it just makes the consequences of not having it less scary.

The test makes sense, thanks!

Closed by commit rMOZILLACENTRALc0922a11dfc9: Bug 1752120: Null out the mWorkerPrivate on WorkerGlobalScopeBase when a worker… (authored by jstutte). · Explain WhyFeb 16 2022, 5:04 PM

This revision was automatically updated to reflect the committed changes.

jstutte added a commit: rMOZILLACENTRALc0922a11dfc9: Bug 1752120: Null out the mWorkerPrivate on WorkerGlobalScopeBase when a worker….

phab-bot changed the visibility from "Custom Policy" to "Custom Policy".Feb 16 2022, 10:00 PM

phab-bot changed the edit policy from "Custom Policy" to "Custom Policy".

phab-bot added a project: Restricted Project.

phab-bot added a subscriber: pascalc.Feb 17 2022, 5:02 PM

jstutte added a commit: Restricted Diffusion Commit.Mar 8 2022, 10:22 AM

phab-bot removed a subscriber: dveditz.Mar 17 2022, 6:20 PM

saschanaz mentioned this in D137792: Bug 1764921 - Delay full worker shutdown until LockManagerChild is destructed r=asuth.Mar 21 2022, 5:49 PM

jstutte mentioned this in D141507: Bug 1756172: Unroot the global scopes only after the last worker event ran. r?#dom-worker-reviewers.Mar 21 2022, 6:26 PM

jstutte mentioned this in rMOZILLACENTRAL867ebd62b5eb: Bug 1756172: Unroot the global scopes only after the last worker event ran..Mar 25 2022, 3:41 PM

phab-bot added a subscriber: RyanVM.Apr 1 2022, 5:33 PM

phab-bot added a subscriber: Wolfbeast.Apr 8 2022, 7:16 AM

phab-bot changed the visibility from "Custom Policy" to "Public (No Login Required)".Aug 28 2022, 5:23 AM

phab-bot changed the edit policy from "Custom Policy" to "Restricted Project (Project)".

phab-bot removed projects: Restricted Project, Restricted Project, secure-revision.

Bug 1752120: Null out the mWorkerPrivate on WorkerGlobalScopeBase when a worker ends. r?#dom-worker-reviewers
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 540467

dom/workers/WorkerPrivate.cpp

dom/workers/WorkerScope.cpp

dom/workers/test/clearTimeoutsImplicit_worker.js

dom/workers/test/mochitest.ini

dom/workers/test/test_clearTimeoutsImplicit.html

Bug 1752120: Null out the mWorkerPrivate on WorkerGlobalScopeBase when a worker ends. r?#dom-worker-reviewersClosedPublicActions