Wednesday, May 15, 2024

unusual aggregations

I've been doing a sizable amount of SQL querying, and it turns out the modern systems have some new and interesting aggregation functions. Like for example MAX_BY and MIN_BY that return the value of one expression from the row where another expression is maximal or minimal. And there are more creative functions if your system supports nested fields, array fields, or map fields. 

The other real nice syntax is using the indexes of the fields in the GROUP BY clause instead of the actual expressions - this avoids the need to write the same expressions twice. It could be made even better by specifying the names of the output fields instead of indexes.

There are a few more aggregation functions that would be nice to have.

One is being able to specify aggregation in a group vs aggregation across all records. A frequent problem is to find the percentage of record count in a group relative to the whole set. This would be easy to implement as (COUNT(*) * 100 / COUNT_GLOBAL(*)). And this could be generalized to nested aggregations (similarly to how Triceps allows the nested indexes) by specifying the nesting level. For example, if we have 4 fields in a GROUP BY clause, the generalized form COUNT_G(4, *) would be equivalent to COUNT(*), COUNT_G(0, *) will be the global count, and the intermediate values would do the grouping in the larger groups by fewer fields.

It would also be nice to have some simpler syntax to specify the two-pass querying.  It can be done with joins but it's very annoying to write those joins manually in plain SQL. I'm not sure yet what would be the good syntax, just have an idea of the semantics for now.

One example of that would be the grouping by percentiles. On the first pass you'd compute the percentiles of some expression, on the second pass you'd group the records by where they fit between these percentiles. So if you do the percentiles with a step of 10%, you'd get ten records with the summaries of these percentiles. Well, if you store the records you work on (as in a CEP system), it doesn't even have to be two passes, it can easily be a single pass where you decide in which bucket to fit a record, and then adjust all the buckets accordingly.

Another example of two-pass querying would be bringing the CEP windows into the plain SQL, in case if you want to identify some interesting records, and then go an aggregation on the other related records that preceded them within a time interval. Again, if you have the records come in the time order and cached for that time interval, it can be done with a single pass, but even if that is not available, two passes would do it in any circumstances. Very, very useful for time series analysis.

Or if you think about two-pass aggregation as being done with a map-reduce aggregator, the first pass would seed the mapping function and instantiate the reducers, where the second pass would feed the records into these reducers.

Note also that if you write such a join manually, you end up doing essentially a transposition of a normal query, where instead of specifying an expression per record, you specify an expression per row. Something like this (in pseudo-SQL, since remembering the exact syntax boggles me, I can only copy-paste fragments but not freely write them):

WITH
  SELECT .. AS input
WITH
  SELECT low, high
  FROM
    -- this is the transposition that propagates through
    -- the rest of the query
    INSERT (0, 20), (20, 40), (40, 60), (60, 80), (80, 100)
  AS percentiles
WITH
  SELECT
    percentiles.high AS percent,
    PERCENTILE(input.field, percentiles.low) AS low,
    PERCENTILE(input.field, percentiles.high) AS high
  FROM percentiles, input
  AS boundaries
SELECT
  percent,
  ... whatever aggregation ...
FROM boundaries, input
WHERE
  input.field >= boundaries.low
  AND input.field < boundaries.high

It would be nice to have an easier way to do such transpositions too, even aside from the two-pass queries.

Another idea for processing the time series is the ability to build a group from a set of records with sequential IDs/ timestamps. Suppose we have time in slices, where an event might be happening or not happening in each slice, and we want to consider an uninterrupted sequence of slices with events as one event. It's easy to build a join that would put together two consecutive events. But how about a hundred events? This would be easy to do with a group-building function that tracks the first and last timestamp in the group, looks for a new record being consecutive, and merging the groups if needed. Note that this is not an aggregation function, really. It's a new concept, a group building function that replaces the fixed function of all the values in a group having the same value. It would also work for building a window around an interesting record (with or without merging the close interesting records into the same window), where the window becomes the group.

No comments:

Post a Comment