Page MenuHomePhabricator

Bug 1667245 - collect telemetry when we would have shown the slow script modal dialog for parent process hangs, r?mccr8!,esmyth!
ClosedPublic

Authored by Gijs on Sep 29 2020, 2:13 PM.

Diff Detail

Repository
rMOZILLACENTRAL mozilla-central
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

phab-bot changed the visibility from "Custom Policy" to "Public (No Login Required)".
phab-bot changed the edit policy from "Custom Policy" to "Restricted Project (Project)".
phab-bot removed a project: secure-revision.

@esmyth not looking for a review of the code here, just making you aware of what I'm implementing.

Specifically, I'm collecting data when the hang resolves, ie when control returns to the event loop after being "stuck" in a JS task. At that point, I'm checking if we've spent more than 10s on something (which would have triggered the dialog) and if so, recording the hang duration and that it's a browser script (nothing else should run in the parent, in theory). I'm only collecting data if e10s is turned on (because otherwise these could be content hangs as they'd be in the same process).

The other keys from the events dictionary:

end_reason: Why the warning was hidden (user action, the process becoming responsive again, the browser quitting, etc.)
wait_count: How many times the user elected to wait.
n_tab_deselect: How many times the user switched away from a tab affected by this hang.
uptime: How long the notification was up (ms).

don't apply to the browser case - there's no warning/notification so the user cannot change tabs or explicitly select to "wait". Those items won't be in the dictionary for these events - I assume that's OK.

js/xpconnect/src/XPCJSContext.cpp
807

I cribbed this from the other C++ telemetry event consumer I found at https://searchfox.org/mozilla-central/rev/f27594d62e7f1d57626889255ce6a3071d67209f/dom/security/nsContentSecurityUtils.cpp#583-586 . Unsure if we need an atomic here? Also, the code there did:

if (!atomic.exchange(true)) {
  atomic = true;
}

which looked strange / wrong. AFAICT exchange should already set the value to true? So I dropped the atomic = true bit. LMK if I'm misunderstanding how this is supposed to work...

js/xpconnect/src/XPCJSContext.cpp
1410

Also, I think this actually does something subtly different from using NowLoRes() ? I got strange results when comparing mozilla::TimeStamp() and the mSlowScriptCheckpoint -- all the other assignments use NowLoRes, see https://searchfox.org/mozilla-central/search?q=mSlowScriptCheckpoint%20%3D . I think the assignment here normally gets clobbered by the one in BeforeProcessTask. Anyway, I switched it to NowLoRes which is consistent with the other sites.

js/xpconnect/src/XPCJSContext.cpp
1434

@esmyth Oh right, and the cut-off I'm using is 10s -- which is what we use for content process hangs; I could switch to 20s (or any other number really) given that we used to use 20s before we disabled the dialog altogether in D84396 .

esmyth added inline comments.
js/xpconnect/src/XPCJSContext.cpp
1434

I'd prefer to use 10s for the cutoff. I don't know the reasoning behind the original 20s threshold, but 10s seems more appropriate given the thresholds for user perceptions of time. If we decide later that we only care about 20s hangs, we can always filter on the telemetry, but at least we'll have it.

mccr8 requested changes to this revision.Sep 29 2020, 10:30 PM

I'll give the disclaimer that I've never looked at this code before, so my review is based on reading over code right now. Let me know if I'm wrong about anything here.

Also, I have some vague idea that PAC JS runs in the parent, and is content JS. I'm not sure if that matters or not.

I haven't looked into what this telemetry event thing is, as I've kind of hit my limit for reading new code for the moment. I'll look at it in my next review. Hopefully there's nothing tricky going on.

There's no testing here. Is there some kind of test we could write? If not, what should the exemption be under the new testing policy?

js/xpconnect/src/XPCJSContext.cpp
807

This is only ever set from XPCJSContext, so it should only ever be set on the main thread so this shouldn't need to be atomic. The CSP stuff is getting called from channels, so maybe there's a way it can happen on multiple threads. (I see what you mean about the exchange, but I'm not going to try to think too deeply about it because I think we don't need it here.)

1410

I don't understand this change. TimeStamp() is the null TimeStamp. It is completely different than NowLoRes(). Isn't the idea that we're done running JS, so we clear out the checkpoint state to the initial state? When were you comparing mozilla::TimeStamp() and mSlowScriptCheckpoint?

1432

How can mSlowScriptCheckpoint be null here? Is there some way we can call AfterProcessTask without calling BeforeProcessTask beforehand? Also, with your change below mSlowScriptCheckpoint will only be null when we first create the XPCJSContext.

1434

Please define and use a constant instead of having the literal 10.

1434

I don't think this check is strong enough. It needs to somehow use the same mechanism as mSlowScriptSecondHalf to avoid recording telemetry when the user puts their computer to sleep while JS is running. Note that you can't just check that variable because dom.max_chrome_script_run_time is zero so mSlowScriptSecondHalf will never be true here if we were running chrome JS. Maybe InterruptCallback could be modified to set mSlowScriptSecondHalf in the case of chrome JS, but hard code that we never want to actually limit chrome script run time? The pref name wouldn't really be accurate in that case.

1438

Please use an |auto| on the left, given that the RHS makes the type clear.

This revision now requires changes to proceed.Sep 29 2020, 10:30 PM
Gijs marked an inline comment as done.Oct 2 2020, 3:41 PM

I'll give the disclaimer that I've never looked at this code before, so my review is based on reading over code right now. Let me know if I'm wrong about anything here.

Also, I have some vague idea that PAC JS runs in the parent, and is content JS. I'm not sure if that matters or not.

It sort of does in that it'll use the content preferences to still show the awful modal dialog thingy, because the subject principal for pac script isn't going to be system (well, at least I sure hope not).

I haven't looked into what this telemetry event thing is, as I've kind of hit my limit for reading new code for the moment. I'll look at it in my next review. Hopefully there's nothing tricky going on.

There's no testing here. Is there some kind of test we could write? If not, what should the exemption be under the new testing policy?

I think the unchanged one works as this only affects telemetry, not users directly. I could try to add some testing, though in that case I'd want to make the 10s cutoff a pref so we don't have to actually hang the parent process for 10s (plus we'd have to deal with timing issues)...

js/xpconnect/src/XPCJSContext.cpp
1410

I didn't realize that the default constructor of mozilla::TimeStamp() just creates a null instance; this is different from how JS Date works...

1432

The InterruptCallback null-checks, which is why I did the same, out of an abundance of caution.

Gijs marked an inline comment as done.
Gijs marked 3 inline comments as done.Oct 2 2020, 7:18 PM

I'll give the disclaimer that I've never looked at this code before, so my review is based on reading over code right now. Let me know if I'm wrong about anything here.

Also, I have some vague idea that PAC JS runs in the parent, and is content JS. I'm not sure if that matters or not.

It sort of does in that it'll use the content preferences to still show the awful modal dialog thingy, because the subject principal for pac script isn't going to be system (well, at least I sure hope not).

It looks like (cf. https://mozilla.slack.com/archives/C495VS94P/p1601650904064100 ) the pac stuff does not use these hooks, and also has no principals.

However, I tested with a custom sandbox in the parent process initialized with a null principal, and using sandbox.eval("[while busy loop for 10s]") and the current patch now correctly categorizes that as "content" URI type hang in the parent process - thanks for flagging that up.

I think the unchanged one works as this only affects telemetry, not users directly. I could try to add some testing, though in that case I'd want to make the 10s cutoff a pref so we don't have to actually hang the parent process for 10s (plus we'd have to deal with timing issues)...

I think testing telemetry isn't too common, but given that I think that the patch as submitted wouldn't work in at least the "computer sleep case" it would be nice to have some testing. I'm not sure how to write such a test without some kind of invasive machinery, so maybe it isn't worth landing an automated test, but I think it is at least worth thinking about.

js/xpconnect/src/XPCJSContext.cpp
1432

The InterruptCallback has to deal with any place where we can run JS, so we may not have run BeforeProcessTask, but I think we should be able to rely on AfterProcessTask only running if BeforeProcessTask has run.

Is this ready for re-review? I saw that the red x was still there, but it is showing up in my review queue so I'm not sure.

Gijs planned changes to this revision.Oct 6 2020, 11:37 PM

Is this ready for re-review? I saw that the red x was still there, but it is showing up in my review queue so I'm not sure.

It was, but then we had the discussion around tests. I have a test ready, though it's a bit ugly -- but I need to remove some printfs and manually check your point around machine sleep stuff, and it's late so I'll do it tomorrow. I'll take this out of your queue for now.

I've added a test with relatively minor changes. I also tried to manually test that things work when going to sleep. At least on macOS, the machine refuses to sleep while we're in a parent process slow script busy-loop (it goes to sleep once the loop finishes) so depending on how you want to define things, 'it worked' or I couldn't test this...

mccr8 added a project: testing-approved.

Thanks for writing the test.

I guess it makes sense that the OS detects a hang and refuses to sleep. I've seen similar things on Windows. Hopefully this code isn't needed, and then it doesn't matter if it is right or wrong.

js/xpconnect/src/XPCJSContext.cpp
575

Please use a static const variable instead of a macro.

695

Please name the test here.

This revision is now accepted and ready to land.Oct 7 2020, 6:33 PM