Monday, March 22, 2021

on automatic alerting

 A few years ago I've read the book "Thinking fast and slow" by Daniel Kahneman. One of the things that stuck the chord fo rme thre was the story of how the smaller counties have the cases of both higher and lower occurrence of the diseases than the larger counties. This is very much the same as what we see with the automatic alerting: when we set up an alert for, say, request processing latency, and at night there is only one request during a time period that has an anomalously high latency and triggers an alert (and we solve this with the cludges like "don't alert if there are less than 20 points").

It looked like the number of points needs to be taken in the account, with the fewer points to be treated in a more relaxed way. But I couldn't figure out how to put it together.

Recently I've been reading some books on the probablity, and it looks like the solution happens to be well-researched: we have to treat the number of points as a sample from the full larger number of points. So if we get 1000 hits per a unit of time during the day and only 50 hits during the nighttime, we can really see these 50 hits as a sample of the full traffic, with the people sleeping serving as a filter. And then instead of looking at "is the median over N" we should look at a differnt question: given the sample, what is the probability (i.e. confidence range) that the median of the full set of data would have been over N? That should solve the issue with the false alerts pretty nicely in an automatic way. I wonder, why nobody I've heard of had laready done this?

No comments:

Post a Comment