Calculating Percentiles on Streaming Data Part 7: Cormode-Korn-Muthukrishnan-Srivastava

This is part 7 of my series on calculating percentiles on streaming data.

In 2005, Graham Cormode, Flip Korn, S. Muthukrishnan, and Divesh Srivastava published a paper called Effective Computation of Biased Quantiles over Data Streams [CKMS05].  This paper took the Greenwald-Khanna algorithm [GK01] and made the following notable changes:

  1. Generalized the algorithm to work with arbitrary targeted quantiles, where a targeted quantile is a combination of a quantile \phi and a maximum allowable error \epsilon.  The algorithm will maintain the summary data structure \mathsf{S} such that it will maintain each targeted quantile with the appropriate error rate.
  2. Simplified the Compress() operation by removing the concept of banding.

The paper also demonstrated that the new, more generalized algorithm could handle the following cases with ease:

  1. The uniform quantile case, where each quantile is maintained with identical allowable error \epsilon.
  2. The low-biased quantile case, where lower quantiles (i.e. as \phi approaches 0) are maintained with higher precision (lower maximum allowable error) than higher quantiles.
  3. The high-biased quantile case, where higher quantiles (i.e. as \phi approaches 1) are maintained with higher precision (lower maximum allowable error) than lower quantiles.

Visualizing the Behavior of CKMS

Similar to my post about Visualizing Greenwald-Khanna, let’s visualize the above three cases with CKMS.

Uniform Quantiles

ckms_uq_final

Low-Biased Quantiles

ckms_lbq_final

High-Biased Quantiles

ckms_hbq_final

It’s quite visually intuitive how the low-biased quantiles case keeps more, finer-grained measurements at low percentile values, and the high-biased quantiles case keeps more, finer-grained measurements at high percentile values.

Understanding the CKMS Algorithm

Much of the below will only make sense if you have read the paper, so I suggest trying to read [CKMS05] first, and then returning here.

The key to understanding the CKMS algorithm is understanding the invariant function f(r_i, n).  I found that the best way to understand f(r_i, n) is to understand the targeted quantiles invariant function (Definition 5 in [CKMS05]) and the error function f(\phi n, n) / n charts (Figure 2 in [CKMS05]).

Uniform Quantiles Invariant Function

Consider the uniform quantile case, where we want all quantiles to be maintained with identical allowable error \epsilon.  In this case, the targeted quantile tuples set will become \mathsf{T} = \{(\frac{1}{n}, \epsilon), (\frac{2}{n}, \epsilon), \ldots, (\frac{n-1}{n}, \epsilon), (1, \epsilon)\}.

Given the above definition, below is a visualization of the error function f(\phi n)/n) computed for \epsilon = 0.1 as n increases from 3 to 10:

anim

Visually, the solid polyline can be interpreted as the amount of error allowed at a given value of \phi; the higher the line, the more error allowed at that particular percentile.

Notice how the solid polyline is converging towards a horizontal line at f(\phi n, n)/n = 0.2, which exactly corresponds to the GK invariant function f(r_i, n) = 2 \epsilon n, as found in [GK01].

Low-Biased Quantiles Invariant Function

Now consider the low-biased quantile case, where we want lower quantiles to have higher precision than higher quantiles.  In this case, the targeted quantile tuples set will become \mathsf{T} = \{(\frac{1}{n}, \frac{\epsilon}{n}), (\frac{2}{n}, \frac{2\epsilon}{n}), \ldots, (\frac{n-1}{n}, \frac{(n-1)\epsilon}{n}), (1, \epsilon)\}.

Given the above definition, below is a visualization of the error function f(\phi n)/n) computed for \epsilon = 0.1 as n increases from 3 to 10:

ckms_lbq

Notice how the solid polyline allows less error at lower percentiles and is converging towards the line f(\phi n, n)/n = 0.2\phi, which exactly corresponds to the low-biased quantiles invariant function f(r_i, n) = 2 \epsilon r_i, as found in Definition 3 of [CKMS05].

High-Biased Quantiles Invariant Function

Now consider the high-biased quantile case, where we want higher quantiles to have higher precision than lower quantiles.  In this case, the targeted quantile tuples set will become \mathsf{T} = \{(\frac{1}{n}, \epsilon), (\frac{2}{n}, \frac{(n-1)\epsilon}{n}), \ldots, (\frac{n-1}{n}, \frac{2\epsilon}{n}), (1, \frac{\epsilon}{n})\}.

Given the above definition, below is a visualization of the error function f(\phi n)/n) computed for \epsilon = 0.1 as n increases from 3 to 10:

ckms_hbq

Notice how the solid polyline allows less error at higher percentiles and is converging towards the line f(\phi n, n)/n = 0.2(1 - \phi), which corresponds to the high-biased quantiles invariant function f(r_i, n) = 2 \epsilon (n - r_i) (not found in the [CKMS05] paper).

Availability in Streaming Percentiles Library

All four variants (uniform quantile, low-biased quantile, high-biased quantile, and targeted quantile) of the CKMS algorithm are available in my streaming percentiles library.  Here’s an example of what it’s like to use:

#include <stmpct/ckms_hbq.hpp> // High-biased quantiles

ckms_hbq c(0.01);
for (int i = 0; i < 1000; ++i)
    c.insert(rand());
double p50 = c.quantile(0.5); // Approx. median
double p95 = c.quantile(0.95); // Approx. 95th percentile

Or, from JavaScript:

var sp = require('streamingPercentiles.v1.min.js');
var c = new sp.CKMS_HBQ(0.1);
for (var i = 0; i < 1000; ++i)
    c.insert(Math.random());
var p50 = c.quantile(0.5); // Approx. median
var p95 = c.quantile(0.95); // Approx. 95th percentile

For more information, visit the streaming percentiles library GitHub page.

References

  • [GK01] M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proceedings of ACM SIGMOD, pages 58–66, 2001.
  • [CKMS05] G. Cormode, F. Korn, S. Muthukrishnan and D. Srivastava. Effective Computation of Biased Quantiles over Data Streams. In Proceedings of the 21st International Conference on Data Engineering, pages 20-31, 2005.

Calculating Percentiles on Streaming Data Part 6: Building a C++ and JavaScript Library from a Single Codebase

This is part 6 of my series on calculating percentiles on streaming data.

For the past 10 days or so, I’ve been working on the build process of my C++ and JavaScript streaming analytics libraries. Using the magic of Emscripten, I have been able to combine both libraries into a single, C++ codebase, from which I can compile both the C++ and JavaScript versions of the library. Furthermore, I was able to do this without breaking backwards compatibility of the JavaScript library. I also made a number of other improvements to the compilation process, such as providing precompiled binaries and supporting both shared and static libraries.

This blog post will discuss how the consolidated build process works. If you’d like to follow along using the source code, its available at https://github.com/sengelha/streaming-percentiles-cpp.

Overview of Build System

As the streaming analytics library is intended to work across multiple operating systems, I use CMake for the core of the build system. There are plenty of good blog articles about using CMake for generating cross-platform C++ libraries, so I don’t need to say much more about that here.

As invoking the full build process properly requires multiple steps, I created the scripts build.sh (for Linux / Mac OS X) and build.bat (for Windows) to simplify the process. These scripts are responsible for performing the build, running the unit tests, and creating the binary packages. The entire build process is performed into a directory outside of the source tree, which is considered a CMake best practice. The scripts support various command-line options (e.g. build.sh --release for a release build, build.sh --clean for a clean build) to allow a developer fine-grained control on the build process; use --help to view information on how to use them.

Shared and Static Libraries

The CMake project supports building both static and shared versions of the streaming analytics library. For Linux and Mac OS X, it’s as simple as having two CMake targets: add_library(xxx_shared SHARED ${SRC}) and add_library(xxx_static STATIC ${SRC}).

However, for Windows, this isn’t quite enough. In order to properly build a Windows shared library, all exposed classes, functions, etc. must be marked with __declspec(dllexport). There’s an entire CMake Wiki page on how to do it. I haven’t gotten around to doing that yet, so shared library support on Windows is currently disabled. To avoid a future file name conflict for Windows builds, I gave the static version of the library a different name than the shared version of the library (stmpcts.lib instead of stmpct.lib).

I also found that on non-x86 systems, even static versions of the library need to be built with position-independent code, which I enabled with set (CMAKE_POSITION_INDEPENDENT_CODE TRUE).

C++ Unit Testing

For C++ unit tests, I decided upon the Boost.Test library. All C++ unit tests are compiled into a single executable which is then tested using CTest.

In order to test both the static and shared versions of the library, this executable is actually built twice: once linked against the static version of the library, the other linked against the shared version.

Using Emscripten from CMake to Cross-Compile C++ to JavaScript

Emscripten is some incredibly powerful wizardry. Basically you can point nearly any C++ at it, and it will cross-compile it to JavaScript.

Connecting Emscripten to CMake is quite easy. First you need a script to detect the emscripten compiler:

# FindEmscripten.cmake
# - try to find emscripten binary
#
# Variables you might use in your CMakeLists.txt:
#  EMSCRIPTEN_FOUND

find_program(EMSCRIPTEN_CPP_BINARY
             NAMES em++)
mark_as_advanced(EMSCRIPTEN_CPP_BINARY)

include(FindPackageHandleStandardArgs)
find_package_handle_standard_args(emscripten
    DEFAULT_MSG
    EMSCRIPTEN_CPP_BINARY)

Then it’s as simple as the following:

set(CMAKE_CXX_COMPILER em++)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} --bind -std=c++11 -O3 --memory-init-file 0")
add_executable(xxx.js ${SRC})

I added a few more niceties, such as:

  • Used embind to naturally expose the classes in JavaScript
  • Added JavaScript unit tests using Mocha and unit.js
  • Added an uglify-js step to the build process to create a minified version of the built JavaScript

For more details, see the source code, particularly the js/ directory.

Packaging

For distribution convenience, I use CPack for creating the final packages of compiled code. Using CPack makes doing this nice and easy; simply set some variables (e.g. set(CPACK_PACKAGE_NAME "streaming_percentiles")), add include(CPack), and mark the files which should be included in the package with install(...).

For Windows, I prefer generating a ZIP package instead of a MSI; enabling this is as easy as set(CPACK_GENERATOR "ZIP").

Wrapping Up

There are a few more tricks in the codebase, like combining add_custom_command(OUTPUT xyz) with add_custom_target(xxx ALL DEPENDS xyz) to avoid excessive rebuilds, that you can discover for yourself. Just check out the source code for the project at https://github.com/sengelha/streaming-percentiles-cpp/.

If you’re only interested in the library itself, you can download pre-built binaries for Mac OS X / Linux / Windows / JavaScript from https://github.com/sengelha/streaming-percentiles-cpp/releases.

Calculating Percentiles on Streaming Data Part 5: C++ Library

1200px-ISO_C++_Logo.svgThis is part 5 of my series on calculating percentiles on streaming data.

I have created a reusable C++ library which contains my implementation of the streaming percentile algorithms found within this blog post and published it to GitHub. Here’s what using it looks like:

#include <stmpct/gk.hpp>

using namespace stmpct;

double epsilon = 0.1;
gk g(epsilon);
for (int i = 0; i < 1000; ++i)
    g.insert(rand());
double p50 = g.quantile(0.5); // Approx. median
double p95 = g.quantile(0.95); // Approx. 95th percentile

You can find it here:

Calculating Percentiles on Streaming Data Part 4: JavaScript Library

This is part 4 of my series on calculating percentiles on streaming data.

I have created a reusable JavaScript library which contains my implementation of the streaming percentile algorithms found within this blog post and published it to GitHub and NPM. Here’s what using it looks like:

var sp = require('streaming-percentiles');

// epsilon is allowable error.  As epsilon becomes smaller, the
// accuracy of the approximations improves, but the class consumes
// more memory.
var epsilon = 0.1;
var gk = new sp.GK(epsilon);
for (var i = 0; i < 1000; ++i)
    gk.insert(Math.random());
var p50 = gk.quantile(0.5); // Approx. median
var p95 = gk.quantile(0.95); // Approx. 95th percentile

You can find it here:

Calculating Percentiles on Streaming Data Part 3: Visualizing Greenwald-Khanna

This is part 3 of my series on calculating percentiles on streaming data.

In an effort to better understand the Greenwald-Khanna [GK01] algorithm, I created a series of visualizations of the cumulative distribution functions of a randomly-generated, normally-distributed data set with \mu = 0 and \sigma = 1, as the number of random numbers n increases from 1 to 1,000.

The way to read these visualizations is to find the percentile you are looking for on the y-axis, then trace horizontally to find the vertical line on the chart which intersects this location, then read the value from the x-axis.

Exact:
Greenwald-Khanna \epsilon = 0.1:
Greenwald-Khanna \epsilon = 0.05:
Greenwald-Khanna \epsilon = 0.01:

From these visualizations, it is quite intuitive and clear how the “resolution” of Greenwald-Khanna increases as \epsilon decreases, and how the compress operation keeps the number of elements in the summary data set \mathsf{S} relatively small as n increases.

References

  • [GK01] M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proceedings of ACM SIGMOD, pages 58–66, 2001.

Calculating Percentiles on Streaming Data Part 2: Notes on Implementing Greenwald-Khanna


This is part 2 of my series on calculating percentiles on streaming data.

The most famous algorithm for calculating percentiles on streaming data appears to be Greenwald-Khanna [GK01]. I spent a few days implementing the Greenwald-Khanna algorithm from the paper and I discovered a few things I wanted to share.

Insert Operation

The insert operation is defined in [GK01] as follows:

INSERT(v)
Find the smallest i, such that v_{i-1} \leq v < v_i, and insert the tuple t_x = (v_x, g_x, \Delta_x) = (v, 1, \lfloor 2 \epsilon n \rfloor), between t_{i-1} and t_i. Increment s. As a special case, if v is the new minimum or the maximum observation seen, then insert (v, 1, 0).

However, when I implemented this operation, I noticed via testing that there were certain queries I could not fulfill. For example, consider applying Greenwald-Khanna with \epsilon = 0.1 to the sequence of values \{11, 20, 18, 5, 12, 6, 3, 2\}, and then apply QUANTILE(\phi = 0.5). This means that r = \lceil \phi n \rceil = \lceil 0.5 \times 8 \rceil = 4 and \epsilon n = 0.1 \times 8 = 0.8. There is no i that satisfies both r - r_{min}(v_i) \leq \epsilon n and r_{max}(v_i) - r \leq \epsilon n, as you can see below:

i t_i r_{min} r_{max} r - r_{min} r_{max}-r r - r_{min}\overset{?}{\leq}\epsilon n r_{max} - r\overset{?}{\leq}\epsilon n
0 (2,1,0) 1 1 3 -3 F T
1 (3,1,0) 2 2 2 -2 F T
2 (5,1,0) 3 3 1 -1 F T
3 (6,1,1) 4 5 0 1 T F
4 (11,1,0) 5 5 -1 1 T F
5 (12,1,0) 6 6 -2 2 T F
6 (18,1,0) 7 7 -3 3 T F
7 (20,1,0) 8 8 -4 4 T F

I believe the fundamental problem is that the definition of INSERT(v) fails to maintain the key invariant \textrm{max}_i(g_i + \Delta_i) \leq 2 \epsilon n. To correct this issue, the Greenwald-Khanna INSERT(v) operation must be modified as follows:

INSERT(v)*
Find the smallest i, such that v_{i-1} \leq v < v_i, and insert the tuple t_x = (v_x, g_x, \Delta_x) = (v, 1, \lfloor 2 \epsilon n \rfloor - 1), between t_{i-1} and t_i. Increment s. As a special case, if v is the new minimum or the maximum observation seen, then insert (v, 1, 0). Also, the first 1/(2 \epsilon) elements must be inserted with \Delta_i = 0.

I found the above modification from Prof. Chandra Chekuri’s lecture notes for his class CS 598CSC: Algorithms for Big Data. I believe this modification will maintain the above invariant after 1/(2 \epsilon) insertions.

Band Construction

Banding in [GK01] refers to the grouping together of possible values of \Delta for purposes of merging in the COMPRESS operation. The paper defines banding as the following:

BAND(\Delta, 2 \epsilon n)
For \alpha from 1 to \lceil \log 2 \epsilon n \rceil, we let p = \lfloor 2 \epsilon n \rfloor and we define \mathrm{band}_\alpha to be the set of all \Delta such that p - 2^\alpha - (p \mod 2^\alpha) < \Delta \leq p - 2^{\alpha - 1} - (p \mod 2^{\alpha - 1}). We define \mathrm{band}_0 to simply be p. As a special case, we consider the first 1/2\epsilon observations, with \Delta = 0, to be in a band of their own. We will denote by \mathrm{band}(t_i, n) the band of \Delta_i at time n, and by \mathrm{band}_\alpha(n) all tuples (or equivalently, the \Delta values associated with these tuples) that have a band value of \alpha.

Here are some things I found when implementing banding:

  • It is important to note that in the above \log refers to the base 2 logarithm of a number.
  • I found it useful to construct an array, bands, at the beginning of the COMPRESS operation, such that bands[\Delta] is the band for the provided \Delta.
  • I used a very large constant to denote the band of \Delta = 0, as this made the tree-building operation simpler.

Compress

Compression in [GK01] is defined as follows:

COMPRESS()
for i from s - 2 to 0 do …

However, with this definition, it is possible for the first (0th) tuple to be deleted, which would then violate the following invariant of the summary data structure: “We ensure that, at all times, the maximum and the minimum values are part of the summary.”

Therefore, I modified COMPRESS() to go from s - 2 to 1 so that the first tuple is never deleted and the invariant is maintained.

References

  • [GK01] M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proceedings of ACM SIGMOD, pages 58–66, 2001.

Calculating Percentiles on Streaming Data Part 1: Introduction


This is part 1 of my series on calculating percentiles on streaming data.

Suppose that you are dealing with a system which processes one million requests per second, and you’d like to calculate the median percentile response time over the last 24 hours.

The naive approach would be to store every response time, sort them all, and then return the value in the middle. Unfortunately, this approach would require manipulating 1,000,000 * 60 * 60 * 24 = 86.4 billion values — almost certainly too many to fit into RAM, and thus rather unwieldy to work with. This begs the question “Is it possible to compute quantiles without storing every observation?”

Munro and Paterson [MP80] proved that a lower bound of \Omega(n) space is required to exactly compute the median of n values. However, if we’re allowed to compute an approximation, then there are a set of algorithms which can process the data in a single pass, and thus can be used without storing every observation. Notable algorithms with this property include:

  1. Manku, Rajagopalan, and Lindsay [MRL98] — single-pass, but an upper bound on n must be known a priori; seems generally superseded by [GK01]
  2. Manku, Rajagopalan, and Lindsay [MRL99] — randomized algorithm
  3. Greenwald-Khanna [GK01] — single-pass, improves on the space bounds of [MRL98], and removes the requirement that n is known in advance
  4. Gilbert, Kotidis, Muthukrishnan, and Strauss [GKMS02] — randomized algorithm, supports deletes as well as inserts
  5. Cormode and Muthukrishnan [CM04] — improves on the space bounds of [GKMS02]
  6. Cormode, Korn, Muthukrishnan, Divesh Srivastava [CKMS05] — improves [GK01] by better handling distributions with skew when finding targeted quantiles
  7. Tim Dunning’s T-Digest

The above list is intended to be representative, not exhaustive.

In this blog post series, I will explore various methods for calculating percentiles on streaming data.

References

  • [CKMS05] G. Cormode, F. Korn, S. Muthukrishnan and D. Srivastava. Effective Computation of Biased Quantiles over Data Streams. In Proceedings of the 21st International Conference on Data Engineering, pages 20-31, 2005.
  • [CM04] G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. Journal of Algorithms, 2004. in press.
  • [GK01] M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proceedings of ACM SIGMOD, pages 58–66, 2001.
  • [GKMS02] A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. How to summarize the universe: Dynamic maintenance of quantiles. In Proceedings of 28th International Conference on Very Large Data Bases, pages 454–465, 2002.
  • [MP80] J. I. Munro and M. S. Paterson. Selection and sorting with limited storage. Theoretical Computer Science, 12:315–323, 1980.
  • [MRL98] G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Approximate medians and other quantiles in one pass and with limited memory. In Proceedings of ACM SIGMOD, pages 426–435, 1998.
  • [MRL99] G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Random sampling techniques for space efficient online computation of order statistics of large datasets. In Proceedings of ACM SIGMOD, volume 28(2) of SIGMOD Record, pages 251–262, 1999.
%d bloggers like this: