# Calculating Percentiles on Streaming Data Part 2: Notes on Implementing Greenwald-Khanna

March 7, 2018 Leave a comment

This is part 2 of my series on calculating percentiles on streaming data.

The most famous algorithm for calculating percentiles on streaming data appears to be Greenwald-Khanna [GK01]. I spent a few days implementing the Greenwald-Khanna algorithm from the paper and I discovered a few things I wanted to share.

#### Insert Operation

The insert operation is defined in [GK01] as follows:

- INSERT()
- Find the smallest , such that , and insert the tuple , between and . Increment . As a special case, if is the new minimum or the maximum observation seen, then insert .

However, when I implemented this operation, I noticed via testing that there were certain queries I could not fulfill. For example, consider applying Greenwald-Khanna with to the sequence of values , and then apply **QUANTILE()**. This means that and . There is no that satisfies both and , as you can see below:

0 | (2,1,0) | 1 | 1 | 3 | -3 | F | T |

1 | (3,1,0) | 2 | 2 | 2 | -2 | F | T |

2 | (5,1,0) | 3 | 3 | 1 | -1 | F | T |

3 | (6,1,1) | 4 | 5 | 0 | 1 | T | F |

4 | (11,1,0) | 5 | 5 | -1 | 1 | T | F |

5 | (12,1,0) | 6 | 6 | -2 | 2 | T | F |

6 | (18,1,0) | 7 | 7 | -3 | 3 | T | F |

7 | (20,1,0) | 8 | 8 | -4 | 4 | T | F |

I believe the fundamental problem is that the definition of **INSERT()** fails to maintain the key invariant . To correct this issue, the Greenwald-Khanna **INSERT()** operation must be modified as follows:

- INSERT()*
- Find the smallest , such that , and insert the tuple , between and . Increment . As a special case, if is the new minimum or the maximum observation seen, then insert . Also, the first elements must be inserted with .

I found the above modification from Prof. Chandra Chekuri’s lecture notes for his class CS 598CSC: Algorithms for Big Data. I believe this modification will maintain the above invariant after insertions.

#### Band Construction

Banding in [GK01] refers to the grouping together of possible values of for purposes of merging in the **COMPRESS** operation. The paper defines banding as the following:

- BAND(, )
- For from 1 to , we let and we define to be the set of all such that . We define to simply be . As a special case, we consider the first observations, with = 0, to be in a band of their own. We will denote by the band of at time , and by all tuples (or equivalently, the values associated with these tuples) that have a band value of .

Here are some things I found when implementing banding:

- It is important to note that in the above refers to the base 2 logarithm of a number.
- I found it useful to construct an array, bands, at the beginning of the
**COMPRESS**operation, such that bands[] is the band for the provided . - I used a very large constant to denote the band of , as this made the tree-building operation simpler.

#### Compress

Compression in [GK01] is defined as follows:

- COMPRESS()
- for from to do …

However, with this definition, it is possible for the first (0th) tuple to be deleted, which would then violate the following invariant of the summary data structure: “We ensure that, at all times, the maximum and the minimum values are part of the summary.”

Therefore, I modified **COMPRESS()** to go from to so that the first tuple is never deleted and the invariant is maintained.

#### References

- [GK01] M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In
*Proceedings of ACM SIGMOD*, pages 58–66, 2001.