Calculating Percentiles on Streaming Data Part 4: JavaScript Library

This is part 4 of my series on calculating percentiles on streaming data.

I have created a reusable JavaScript library which contains my implementation of the streaming percentile algorithms found within this blog post and published it to GitHub and NPM. Here’s what using it looks like:

var sp = require('streaming-percentiles');

// epsilon is allowable error.  As epsilon becomes smaller, the
// accuracy of the approximations improves, but the class consumes
// more memory.
var epsilon = 0.1;
var gk = new sp.GK(epsilon);
for (var i = 0; i < 1000; ++i)
    gk.insert(Math.random());
var p50 = gk.quantile(0.5); // Approx. median
var p95 = gk.quantile(0.95); // Approx. 95th percentile

You can find it here:

Calculating Percentiles on Streaming Data Part 3: Visualizing Greenwald-Khanna

This is part 3 of my series on calculating percentiles on streaming data.

In an effort to better understand the Greenwald-Khanna [GK01] algorithm, I created a series of visualizations of the cumulative distribution functions of a randomly-generated, normally-distributed data set with \mu = 0 and \sigma = 1, as the number of random numbers n increases from 1 to 1,000.

The way to read these visualizations is to find the percentile you are looking for on the y-axis, then trace horizontally to find the vertical line on the chart which intersects this location, then read the value from the x-axis.

Exact:
Greenwald-Khanna \epsilon = 0.1:
Greenwald-Khanna \epsilon = 0.05:
Greenwald-Khanna \epsilon = 0.01:

From these visualizations, it is quite intuitive and clear how the “resolution” of Greenwald-Khanna increases as \epsilon decreases, and how the compress operation keeps the number of elements in the summary data set \mathsf{S} relatively small as n increases.

References

  • [GK01] M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proceedings of ACM SIGMOD, pages 58–66, 2001.

Calculating Percentiles on Streaming Data Part 2: Notes on Implementing Greenwald-Khanna


This is part 2 of my series on calculating percentiles on streaming data.

The most famous algorithm for calculating percentiles on streaming data appears to be Greenwald-Khanna [GK01]. I spent a few days implementing the Greenwald-Khanna algorithm from the paper and I discovered a few things I wanted to share.

Insert Operation

The insert operation is defined in [GK01] as follows:

INSERT(v)
Find the smallest i, such that v_{i-1} \leq v < v_i, and insert the tuple t_x = (v_x, g_x, \Delta_x) = (v, 1, \lfloor 2 \epsilon n \rfloor), between t_{i-1} and t_i. Increment s. As a special case, if v is the new minimum or the maximum observation seen, then insert (v, 1, 0).

However, when I implemented this operation, I noticed via testing that there were certain queries I could not fulfill. For example, consider applying Greenwald-Khanna with \epsilon = 0.1 to the sequence of values \{11, 20, 18, 5, 12, 6, 3, 2\}, and then apply QUANTILE(\phi = 0.5). This means that r = \lceil \phi n \rceil = \lceil 0.5 \times 8 \rceil = 4 and \epsilon n = 0.1 \times 8 = 0.8. There is no i that satisfies both r - r_{min}(v_i) \leq \epsilon n and r_{max}(v_i) - r \leq \epsilon n, as you can see below:

i t_i r_{min} r_{max} r - r_{min} r_{max}-r r - r_{min}\overset{?}{\leq}\epsilon n r_{max} - r\overset{?}{\leq}\epsilon n
0 (2,1,0) 1 1 3 -3 F T
1 (3,1,0) 2 2 2 -2 F T
2 (5,1,0) 3 3 1 -1 F T
3 (6,1,1) 4 5 0 1 T F
4 (11,1,0) 5 5 -1 1 T F
5 (12,1,0) 6 6 -2 2 T F
6 (18,1,0) 7 7 -3 3 T F
7 (20,1,0) 8 8 -4 4 T F

I believe the fundamental problem is that the definition of INSERT(v) fails to maintain the key invariant \textrm{max}_i(g_i + \Delta_i) \leq 2 \epsilon n. To correct this issue, the Greenwald-Khanna INSERT(v) operation must be modified as follows:

INSERT(v)*
Find the smallest i, such that v_{i-1} \leq v < v_i, and insert the tuple t_x = (v_x, g_x, \Delta_x) = (v, 1, \lfloor 2 \epsilon n \rfloor - 1), between t_{i-1} and t_i. Increment s. As a special case, if v is the new minimum or the maximum observation seen, then insert (v, 1, 0). Also, the first 1/(2 \epsilon) elements must be inserted with \Delta_i = 0.

I found the above modification from Prof. Chandra Chekuri’s lecture notes for his class CS 598CSC: Algorithms for Big Data. I believe this modification will maintain the above invariant after 1/(2 \epsilon) insertions.

Band Construction

Banding in [GK01] refers to the grouping together of possible values of \Delta for purposes of merging in the COMPRESS operation. The paper defines banding as the following:

BAND(\Delta, 2 \epsilon n)
For \alpha from 1 to \lceil \log 2 \epsilon n \rceil, we let p = \lfloor 2 \epsilon n \rfloor and we define \mathrm{band}_\alpha to be the set of all \Delta such that p - 2^\alpha - (p \mod 2^\alpha) < \Delta \leq p - 2^{\alpha - 1} - (p \mod 2^{\alpha - 1}). We define \mathrm{band}_0 to simply be p. As a special case, we consider the first 1/2\epsilon observations, with \Delta = 0, to be in a band of their own. We will denote by \mathrm{band}(t_i, n) the band of \Delta_i at time n, and by \mathrm{band}_\alpha(n) all tuples (or equivalently, the \Delta values associated with these tuples) that have a band value of \alpha.

Here are some things I found when implementing banding:

  • It is important to note that in the above \log refers to the base 2 logarithm of a number.
  • I found it useful to construct an array, bands, at the beginning of the COMPRESS operation, such that bands[\Delta] is the band for the provided \Delta.
  • I used a very large constant to denote the band of \Delta = 0, as this made the tree-building operation simpler.

Compress

Compression in [GK01] is defined as follows:

COMPRESS()
for i from s - 2 to 0 do …

However, with this definition, it is possible for the first (0th) tuple to be deleted, which would then violate the following invariant of the summary data structure: “We ensure that, at all times, the maximum and the minimum values are part of the summary.”

Therefore, I modified COMPRESS() to go from s - 2 to 1 so that the first tuple is never deleted and the invariant is maintained.

References

  • [GK01] M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proceedings of ACM SIGMOD, pages 58–66, 2001.

Calculating Percentiles on Streaming Data Part 1: Introduction


This is part 1 of my series on calculating percentiles on streaming data.

Suppose that you are dealing with a system which processes one million requests per second, and you’d like to calculate the median percentile response time over the last 24 hours.

The naive approach would be to store every response time, sort them all, and then return the value in the middle. Unfortunately, this approach would require manipulating 1,000,000 * 60 * 60 * 24 = 86.4 billion values — almost certainly too many to fit into RAM, and thus rather unwieldy to work with. This begs the question “Is it possible to compute quantiles without storing every observation?”

Munro and Paterson [MP80] proved that a lower bound of \Omega(n) space is required to exactly compute the median of n values. However, if we’re allowed to compute an approximation, then there are a set of algorithms which can process the data in a single pass, and thus can be used without storing every observation. Notable algorithms with this property include:

  1. Manku, Rajagopalan, and Lindsay [MRL98] — single-pass, but an upper bound on n must be known a priori; seems generally superseded by [GK01]
  2. Manku, Rajagopalan, and Lindsay [MRL99] — randomized algorithm
  3. Greenwald-Khanna [GK01] — single-pass, improves on the space bounds of [MRL98], and removes the requirement that n is known in advance
  4. Gilbert, Kotidis, Muthukrishnan, and Strauss [GKMS02] — randomized algorithm, supports deletes as well as inserts
  5. Cormode and Muthukrishnan [CM04] — improves on the space bounds of [GKMS02]
  6. Cormode, Korn, Muthukrishnan, Divesh Srivastava [CKMS05] — improves [GK01] by better handling distributions with skew when finding targeted quantiles
  7. Tim Dunning’s T-Digest

The above list is intended to be representative, not exhaustive.

In this blog post series, I will explore various methods for calculating percentiles on streaming data.

References

  • [CKMS05] G. Cormode, F. Korn, S. Muthukrishnan and D. Srivastava. Effective Computation of Biased Quantiles over Data Streams. In Proceedings of the 21st International Conference on Data Engineering, pages 20-31, 2005.
  • [CM04] G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. Journal of Algorithms, 2004. in press.
  • [GK01] M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proceedings of ACM SIGMOD, pages 58–66, 2001.
  • [GKMS02] A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. How to summarize the universe: Dynamic maintenance of quantiles. In Proceedings of 28th International Conference on Very Large Data Bases, pages 454–465, 2002.
  • [MP80] J. I. Munro and M. S. Paterson. Selection and sorting with limited storage. Theoretical Computer Science, 12:315–323, 1980.
  • [MRL98] G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Approximate medians and other quantiles in one pass and with limited memory. In Proceedings of ACM SIGMOD, pages 426–435, 1998.
  • [MRL99] G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Random sampling techniques for space efficient online computation of order statistics of large datasets. In Proceedings of ACM SIGMOD, volume 28(2) of SIGMOD Record, pages 251–262, 1999.

Visualizing Latency Part 4: Official D3 Latency Heatmap Page

This post is part 4 of my series about visualizing latency, which is very useful for debugging certain classes of performance problems.

Allow me to wrap up my visualizing latency post series by noting that my official D3 latency heatmap repository is at https://github.com/sengelha/d3-latency-heatmap/. Monitor this repository for future developments to the D3 latency heatmap chart.

Visualizing Latency Part 3: Rendering Event Data

This post is part 3 of my series about visualizing latency, which is very useful for debugging certain classes of performance problems.

Now that I have introduced the D3 latency heatmap chart component and explained what binning is, I can discuss the primary use case of the chart: rendering event data.

What is event data?

First, I must explain what I mean by event data. For a fuller treatment, please read Analytics For Hackers: How To Think About Event Data, but allow me to summarize: Event data describes actions performed by entities. It has three key pieces of information: action, timestamp, and state. It is typically rich, denormalized, nested, schemaless, append-only, and frequently extremely large. Some examples of event data include system log records, financial market trades, crime records, or user activities within an application.

When I created the D3 latency heatmap chart component, my primary use case was to be able to visualize the latency of a queue-based report generation system. This system logs every single report generation event, along with important data such as start time and duration, to a table inside a SQL server. I imagine there are thousands (or millions) of systems storing event data into SQL tables — and there’s absolutely nothing wrong with that — but event data is also frequently stored in append-only files on filesystems, in stream processing systems like Apache Kafka, or in distributed databases like Apache Cassandra.

When rendering event data, the key decisions are around binning:

  1. What sizes should the bins have?
  2. How should binning be implemented?
  3. Where should binning be performed?

What sizes should the bins have?

This question was discussed in my blog post where I explained what binning is. The short answer it “it depends on your chart size and data distribution.” As you create your chart, be prepared to try out multiple different binning sizes.

How should binning be implemented?

Let’s explore a few common alternatives for implementing binning.

SQL

The process of binning is conceptually equivalent to the process of a SQL GROUP BY, which leads to one possible implementation: in SQL itself.

Consider a table of events with the following (slightly unrealistic) definition:

CREATE TABLE dbo.events(
    id INT IDENTITY NOT NULL PRIMARY KEY,
    dt datetime NOT NULL,
    val float NOT NULL
);

Let’s say we want to bin the x-axis (time) by day, and the y-axis (value) into buckets of 10 (i.e. a bin for values from 0-10, another from 10-20, etc.) Here’s one way to write that query in SQL:

SELECT CAST(dt AS DATE) AS xval,
       FLOOR(val / 10) * 10 AS yval,
       COUNT(*) AS count
FROM dbo.events
GROUP BY CAST(dt AS DATE), FLOOR(val / 10) * 10

Given the above query, changing the size of the bins is simple and straightforward.

How fast is this query? I measured the above query on my development laptop in Microsoft SQL Server 2016 with a table with 1 billion records on it, as well as various candidate optimizations, below:

Test Case Duration (seconds) Throughput (records/second) Comments
Base case 10.23 97.7 million
Use of COUNT(1) instead of COUNT(*) (statistically indistinguishable) (statistically indistinguishable) As expected, COUNT(1) has no performance impact.
Data sorted by dt then val (CREATE CLUSTERED INDEX IX_events_dt_val ON dbo.events(dt, val))) (statistically indistinguishable) (statistically indistinguishable) I was hoping to eliminate the need for sorts in the query, but perhaps the query optimizer can’t tell that the group by expressions are order-preserving.
Columnstore table (CREATE CLUSTERED COLUMNSTORE INDEX) 3.238 308.8 million Reads are much faster but writes might be slower.
Memory-optimized table (CREATE TABLE … WITH (MEMORY_OPTIMIZED=ON)) (insufficient memory) N/A Apparently 16GB allocated to SQL Server isn’t enough.
Memory-optimized table with 10,000,000 records 1.052 9.5 million Throughput is highly suspicious but memory-optimized tables are generally optimized for OLTP not analytics.

Other possibilities which I did not test include:

  • Different GROUP BY expressions to see if they are more efficient or the optimizer can determine that they are order-preserving
  • An event table that stores date and time in separate columns, as a data warehouse might
  • Memory optimized tables with columnstore indexes
  • Memory optimized tables with natively compiled queries

Conclusion: Don’t spend too much time on query optimization, but consider storing your events in a columnstore table rather than a rowstore table.

Imperative (Traditional) Programming Languages

Another approach to binning is to do it in an imperative programming language (e.g. JavaScript, C#, Java, etc.) Here are a few potential avenues to explore:

  1. Accumulate counts in a dictionary/map/hash table which is keyed by (xval, yval).
  2. Using something like LINQ and expressing the query in a form that’s similar to the SQL query above.
  3. If the data is already properly ordered, and your group by expressions are order-preserving, use this information to be able to process the data piece-by-piece rather than having to keep everything all in RAM.

I’m personally intrigued by #3, as I doubt most general-purpose analytical systems or libraries use this optimization technique.

Stream Processing Systems

Another possible place where binning could be performed would be within a stream processing system. Imagine a near-realtime D3 latency chart rendering binned data from Apache Flink cluster which is continuously aggregating and binning raw event data from a Kafka event bus, or from an Amazon Kinesis Data Analytics SQL query.

Where should binning be performed?

My recommendations are as follows:

  1. If you’re dealing with streaming data, do it in the streaming analytics system (Apache Flink, etc.)
  2. If your data set is small, and you want to support responsive dynamic rebinning of the data (e.g. a dropdown where the user can select whether they want to bin the x-axis by day, month, quarter, or year), pull the raw event data into the web browser and perform the binning there. Pay careful attention to the amount of data you are transferring over the network and the amount of RAM you require. Consider presorting the events by date when retrieving the data from your data store.
  3. If your data set is medium-sized, or you want to use third-party libraries, pull the raw event data into the web server and perform the binning there. Network bandwidth and RAM usage are far less of a concern here than within a web browser, but you must still mind them. Consider presorting the events by date when retrieving the data from your data store.
  4. If your data set is very large, consider binning inside the data storage system (e.g. the SQL server) as part of data retrieval (a.k.a. move compute to data not data to compute).

What do I mean by “small”? The answer is, of course, “it depends”, but here’s how I try to break it down. Let’s say we want to be able to render the chart in 5 seconds or less, and that we’ll allocate 2.5 seconds to downloading the data and 2.5 seconds to the JavaScript execution and rendering. If we estimate the target (95th percentile) user has an Internet connection of 1MB/sec, we can transfer no more than 2.5MB of data. If we transfer the data compressed, and the compression achieves a 10:1 ratio, this is 25MB of raw data, which I imagine shouldn’t be a problem to store in RAM. If a single event is 100 bytes uncompressed, we can transfer no more than 250,000 events to the browser.

Naturally, the above recommendations do not take into account all possible considerations. For example, if you pull the raw event data into a web browser, you may now have to solve a new problem: How to keep the data in the web browser up-to-date? This problem doesn’t exist if you perform the binning where the data resides. On the other hand, if you bin inside your data storage system, what additional I/O, CPU, or cache pressure will you add to the server, and how will this interact with the existing utilization profile of the storage system? As with everything, it’s all about tradeoffs.

Visualizing Latency Part 2: What is Binning?

This post is part 2 of my series about visualizing latency, which is very useful for debugging certain classes of performance problems.

As mentioned in Brendan Gregg’s Latency Heat Maps page, a latency heat map is a visualization where each column of data is a histogram of the observations for that time interval. Using Brendan Gregg’s visualization:

As with histograms, the key decision that needs to be made when using a latency heat map is how to bin the data. Binning is the process of dividing the entire range of values into a series of intervals and then counting how many values fall into each interval. That said, there is no “best” number of bins, and different bin sizes can reveal different features of the data. Ultimately, the number of bins depends on the distribution of your data set as well as the size of the rendering area.

With latency heatmaps, binning often must be performed twice: once for the x-axis (time) and once for the y-axis (interval of observed values).

Allow me to demonstrate this visually.  Here is an Excel file with the historical daily stock price of GE, courtesy of Yahoo! Finance. I have rendered the close prices in D3 Latency Heatmap with four different binning strategies:

12 vertical bins 30 vertical bins
Bin by year-month
Bin by year

As you can see, each chart shows a slightly different perspective.

You may find you need to experiment with multiple binning strategies until you arrive at a latency heatmap chart with the appropriate level of detail for your use case.