Calculating Percentiles on Streaming Data Part 3: Visualizing Greenwald-Khanna

This is part 3 of my series on calculating percentiles on streaming data.

In an effort to better understand the Greenwald-Khanna [GK01] algorithm, I created a series of visualizations of the cumulative distribution functions of a randomly-generated, normally-distributed data set with \mu = 0 and \sigma = 1, as the number of random numbers n increases from 1 to 1,000.

The way to read these visualizations is to find the percentile you are looking for on the y-axis, then trace horizontally to find the vertical line on the chart which intersects this location, then read the value from the x-axis.

Exact:
Greenwald-Khanna \epsilon = 0.1:
Greenwald-Khanna \epsilon = 0.05:
Greenwald-Khanna \epsilon = 0.01:

From these visualizations, it is quite intuitive and clear how the “resolution” of Greenwald-Khanna increases as \epsilon decreases, and how the compress operation keeps the number of elements in the summary data set \mathsf{S} relatively small as n increases.

References

  • [GK01] M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proceedings of ACM SIGMOD, pages 58–66, 2001.

One thought on “Calculating Percentiles on Streaming Data Part 3: Visualizing Greenwald-Khanna

  1. Pingback: Calculating Percentiles on Streaming Data Part 7: Cormode-Korn-Muthukrishnan-Srivastava | Steven Engelhardt, CFA, AIF

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s