Calculating Percentiles on Streaming Data Part 7: Cormode-Korn-Muthukrishnan-Srivastava

This is part 7 of my series on calculating percentiles on streaming data.

In 2005, Graham Cormode, Flip Korn, S. Muthukrishnan, and Divesh Srivastava published a paper called Effective Computation of Biased Quantiles over Data Streams [CKMS05].  This paper took the Greenwald-Khanna algorithm [GK01] and made the following notable changes:

  1. Generalized the algorithm to work with arbitrary targeted quantiles, where a targeted quantile is a combination of a quantile \phi and a maximum allowable error \epsilon.  The algorithm will maintain the summary data structure \mathsf{S} such that it will maintain each targeted quantile with the appropriate error rate.
  2. Simplified the Compress() operation by removing the concept of banding.

The paper also demonstrated that the new, more generalized algorithm could handle the following cases with ease:

  1. The uniform quantile case, where each quantile is maintained with identical allowable error \epsilon.
  2. The low-biased quantile case, where lower quantiles (i.e. as \phi approaches 0) are maintained with higher precision (lower maximum allowable error) than higher quantiles.
  3. The high-biased quantile case, where higher quantiles (i.e. as \phi approaches 1) are maintained with higher precision (lower maximum allowable error) than lower quantiles.

Visualizing the Behavior of CKMS

Similar to my post about Visualizing Greenwald-Khanna, let’s visualize the above three cases with CKMS.

Uniform Quantiles

ckms_uq_final

Low-Biased Quantiles

ckms_lbq_final

High-Biased Quantiles

ckms_hbq_final

It’s quite visually intuitive how the low-biased quantiles case keeps more, finer-grained measurements at low percentile values, and the high-biased quantiles case keeps more, finer-grained measurements at high percentile values.

Understanding the CKMS Algorithm

Much of the below will only make sense if you have read the paper, so I suggest trying to read [CKMS05] first, and then returning here.

The key to understanding the CKMS algorithm is understanding the invariant function f(r_i, n).  I found that the best way to understand f(r_i, n) is to understand the targeted quantiles invariant function (Definition 5 in [CKMS05]) and the error function f(\phi n, n) / n charts (Figure 2 in [CKMS05]).

Uniform Quantiles Invariant Function

Consider the uniform quantile case, where we want all quantiles to be maintained with identical allowable error \epsilon.  In this case, the targeted quantile tuples set will become \mathsf{T} = \{(\frac{1}{n}, \epsilon), (\frac{2}{n}, \epsilon), \ldots, (\frac{n-1}{n}, \epsilon), (1, \epsilon)\}.

Given the above definition, below is a visualization of the error function f(\phi n)/n) computed for \epsilon = 0.1 as n increases from 3 to 10:

anim

Visually, the solid polyline can be interpreted as the amount of error allowed at a given value of \phi; the higher the line, the more error allowed at that particular percentile.

Notice how the solid polyline is converging towards a horizontal line at f(\phi n, n)/n = 0.2, which exactly corresponds to the GK invariant function f(r_i, n) = 2 \epsilon n, as found in [GK01].

Low-Biased Quantiles Invariant Function

Now consider the low-biased quantile case, where we want lower quantiles to have higher precision than higher quantiles.  In this case, the targeted quantile tuples set will become \mathsf{T} = \{(\frac{1}{n}, \frac{\epsilon}{n}), (\frac{2}{n}, \frac{2\epsilon}{n}), \ldots, (\frac{n-1}{n}, \frac{(n-1)\epsilon}{n}), (1, \epsilon)\}.

Given the above definition, below is a visualization of the error function f(\phi n)/n) computed for \epsilon = 0.1 as n increases from 3 to 10:

ckms_lbq

Notice how the solid polyline allows less error at lower percentiles and is converging towards the line f(\phi n, n)/n = 0.2\phi, which exactly corresponds to the low-biased quantiles invariant function f(r_i, n) = 2 \epsilon r_i, as found in Definition 3 of [CKMS05].

High-Biased Quantiles Invariant Function

Now consider the high-biased quantile case, where we want higher quantiles to have higher precision than lower quantiles.  In this case, the targeted quantile tuples set will become \mathsf{T} = \{(\frac{1}{n}, \epsilon), (\frac{2}{n}, \frac{(n-1)\epsilon}{n}), \ldots, (\frac{n-1}{n}, \frac{2\epsilon}{n}), (1, \frac{\epsilon}{n})\}.

Given the above definition, below is a visualization of the error function f(\phi n)/n) computed for \epsilon = 0.1 as n increases from 3 to 10:

ckms_hbq

Notice how the solid polyline allows less error at higher percentiles and is converging towards the line f(\phi n, n)/n = 0.2(1 - \phi), which corresponds to the high-biased quantiles invariant function f(r_i, n) = 2 \epsilon (n - r_i) (not found in the [CKMS05] paper).

Availability in Streaming Percentiles Library

All four variants (uniform quantile, low-biased quantile, high-biased quantile, and targeted quantile) of the CKMS algorithm are available in my streaming percentiles library.  Here’s an example of what it’s like to use:

#include <stmpct/ckms_hbq.hpp> // High-biased quantiles

ckms_hbq c(0.01);
for (int i = 0; i < 1000; ++i)
    c.insert(rand());
double p50 = c.quantile(0.5); // Approx. median
double p95 = c.quantile(0.95); // Approx. 95th percentile

Or, from JavaScript:

var sp = require('streamingPercentiles.v1.min.js');
var c = new sp.CKMS_HBQ(0.1);
for (var i = 0; i < 1000; ++i)
    c.insert(Math.random());
var p50 = c.quantile(0.5); // Approx. median
var p95 = c.quantile(0.95); // Approx. 95th percentile

For more information, visit the streaming percentiles library GitHub page.

References

  • [GK01] M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proceedings of ACM SIGMOD, pages 58–66, 2001.
  • [CKMS05] G. Cormode, F. Korn, S. Muthukrishnan and D. Srivastava. Effective Computation of Biased Quantiles over Data Streams. In Proceedings of the 21st International Conference on Data Engineering, pages 20-31, 2005.

Escaping Strings in XPath 1.0

XPath is a language for selecting nodes from an XML document. XPath is used extensively in XSLT and other XML technologies. I also vastly prefer using XPath (e.g. with XPathNavigator) over the XML DOM when manipulating XML in a non-streaming fashion.

In XPath, strings must be delimited by either single or double quotes. Given a quote character used to delimit a string, one can’t represent that same quote character within the string. This means that if you decide to use single quotes to delimit your XPath string, you couldn’t represent the string O’Reilly; use double quotes, and you can’t represent “Hello”.

However, given a quote delimiter, you can represent the other quote character. We can use this observation along with the concat XPath function to devise a general quoting rule for XPath strings. It’s easiest to show this via a series of examples:

Original String Quoted XPath String
a 'a' (or "a")
O'Reilly "O'Reilly"
"Hello" '"Hello"'
"Hello, Mr. O'Reilly" concat('"Hello, Mr. O', "'Reilly", '"')

Below is a piece of C++ code which implements these quotation rules:

std::string
QuoteXPathString(const std::string& xpath)
{
    // If we don’t have any single or double-quote characters, quote the
    // expression in single quotes.
    std::string::size_type pos = xpath.find_first_of("’\"");
    if (pos == std::string::npos)
        return "’" + xpath + "’";
     
    // If we cannot find the alternate quotation character, quote the
    // expression in the alternate quotation character.
    char chOther = (xpath[pos] == ‘"’ ? ‘\’‘ : ‘"’);
    pos = xpath.find(chOther, pos + 1);
    if (pos == std::string::npos)
        return chOther + xpath + chOther;
     
    // The string has both quotation characters.  We need to use concat()
    // to form the string.
    std::stringstream ss;
    ss << "concat("
       << chOther
       << xpath.substr(0, pos)
       << chOther;
    do {
        chOther = (xpath[pos] == ‘"’ ? ‘\’‘ : ‘"’);
        std::string::size_type pos2 = xpath.find(chOther, pos + 1);
        ss << ‘,’
           << chOther
           << xpath.substr(pos, pos2 – pos)
           << chOther;
        pos = pos2;
    } while (pos != std::string::npos);
    ss << ")";
     
    return ss.str();
}

Usage looks like:

std::string lastName = …; // May come from user input
std::string xpath = "//Customer[LastName = " +
    QuoteXPathString(lastName) + "]";

STL Objects and Win32 Module Boundaries

Let’s say you have the following function:

void AppendChar(std::string& s, char ch)
{
    s += ch;
}

What happens if this function is exported as an ordinal function from a DLL (not an inlined piece of code inside a header) and you call it from an EXE?

It works most of the time. When it doesn’t, it corrupts your heap and causes a spectacular mess.

In Windows you must free memory with the same allocator that allocated it. However, your EXE may not share the same allocator as the DLL. Perhaps the two modules are linked against different versions of libc, or perhaps one of the modules is using a static version of libc. If your EXE and DLL do not share an allocator and if AppendChar resizes the string s, you will almost certainly cause a heap corruption.

The STL performs a lot of reallocations behind the scenes for you; this is one of its major benefits. Unfortunately, if you are writing a general-purpose DLL these behind-the-scene allocations are deadly. You cannot know or dictate what version of libc your clients will use.

Therefore, I reiterate my previous recommendation:

Avoid passing STL objects as parameters to DLLs.

Converting C++ Member Functions into Function Objects

Let’s say you have a C++ function that takes a function object as a parameter and calls it:

template <typename _Fn>
void call_functor(_Fn fn)
{
    fn();
}

Now let’s say you want to pass a class’s member function to call_functor() above, as in:

class C
{
    void foo() { std::cout << "foo()\n"; }
};
     
C c;
call_functor(/* What do I put here? c.foo and &C::foo don’t work */);

The STL has a pointer-to-member function adapter called std::mem_fun() which almost gets us there. Unfortunately, it doesn’t quite meet our needs because it requires us to pass a pointer to an instance of C, as in:

C c;
std::mem_fun(&C::foo)(&c); // The &c is the problem

However, we can use std::mem_fun() if we could figure out a way to create a new functor around std::mem_fun() with &c bound as its first parameter. Unfortunately, we cannot use the STL binders (std::bind1st and std::bind2nd) because they only work on binary functions, not unary functions.

In the general case, you should use Boost’s very powerful binding library bind. However, let’s write our own simple binder for expository purposes.

First, we need a function, bind(), that returns a function object which binds a parameter, p1, to a unary function object, func. We’ll call the returned function object binder.

template <typename _Func, typename _P1>
inline binder<_Func, _P1> bind(_Func func, _P1 p1)
{
    return binder<_Func, _P1>(func, p1);
}

The class binder should store func and p1 and have an operator() which calls func with p1 as its parameter. For simplicity we’ll assume func returns void:

template <typename _Func, typename _P1>
class binder
{
public:
    binder(_Func func, _P1 p1) :
        func_(func), p1_(p1) {}
    void operator()() const { return func_(p1_); }
     
private:
    _Func func_; // The functor to apply
    _P1 p1_; // The first paramter
};

We can now solve the initial problem by combining our bind() with std::mem_fun():

call_functor(bind(std::mem_fun(&C::foo), &c));

We can make usage a little more convenient by introducing a macro:

#define mem_fun_functor(c, memFn) bind(std::mem_fun(memFn), &c)
     
call_functor(mem_fun_functor(c, &C::foo));

There’s plenty of room for improvements, but it’s amazing what you can do with a little template trickery.

STL Map Use

What’s wrong with the following code?

template<typename T1, typename T2>
struct my_pair
{
    typedef T1 first_type;
    typedef T2 second_type;

    my_pair() : first(T1()), second(T2()) {}
    my_pair(const T1& v1, const T2& v2) : first(v1), second(v2) {}

    T1 first;
    T2 second;
};

template<typename T1, typename T2>
inline bool operator<
    (
    const my_pair<T1, T2>& x,
    const my_pair<T1, T2>& y
    )
{
    return (x.first < y.first || x.second < y.second);
}

void f()
{
    typedef my_pair<..., ...> key_type;
    typedef ... value_type;
    typedef std::map<key_type, value_type> map_type;

    map_type map;
    // Use map
}

Answer: my_pair cannot be used as a key for a STL map because the operator< violates the rule of strict weak ordering. More specifically, the operator is not antisymmetric. Consider the following:

typedef my_pair<int, int> P;
P p1(3, 5);
P p2(2, 6);

Both p1 < p2 and p2 < p1 evaluate to true.

If you use my_pair as a key into a map you will not get any compiler errors or warnings but you will likely see some bizarre runtime behavior. For example, keys which were successfully insert()ed may not be able to be found using find().

A correct version of operator< (which places minimal burden on T1 and T2 — they are only required to implement operator<) is:

template<typename T1, typename T2>
inline bool operator<
    (
    const my_pair<T1, T2>& x,
    const my_pair<T1, T2>& y
    )
{
    return
        (
        x.first < y.first ||
        (!(y.first < x.first) && x.second < y.second)
        );
}

STL Vector Use

I recently wrote a piece of code that looked something like the following:

static const int NUM_TOTAL_VALUES = ...;
typedef ... T;

// Create vec and reserve NUM_TOTAL_VALUES spaces for later insertion
std::vector<T> vec(NUM_TOTAL_VALUES);

// Insert values into vec
for (int i = 0; i != NUM_TOTAL_VALUES; ++i)
    vec.push_back(...);

// vec should now have NUM_TOTAL_VALUES values in it (but doesn't!)

What’s wrong with this code?

The constructor vector(size_type _Count); does more than just allocate enough space to store _Count items — it also inserts _Count (default constructed) items into the vector. To reserve space without actually inserting values, use reserve():

static const int NUM_TOTAL_VALUES = ...;
typedef ... T;

std::vector<T> vec;
vec.reserve(NUM_TOTAL_VALUES);

for (int i = 0; i != NUM_TOTAL_VALUES; ++i)
    vec.push_back(...);

// vec now has NUM_TOTAL_VALUES values in it, as intended.

Managed Wrappers and Hidden Interdependencies

Let’s say you have the following unmanaged code:

#pragma unmanaged

class Stream { ... }; // Conceptual stream class

class StreamWriter
{
public:
    StreamWriter(Stream* pStream) : m_pStream(pStream) {}
    ~StreamWriter() { /* Use m_pStream in some way */ }

    ...
private:
    Stream* m_pStream;
};

void f()
{
    Stream stream;
    StreamWriter streamWriter(&stream);

    // Use streamWriter

    // streamWriter is destroyed
    // stream is destroyed
}

Note that StreamWriter’s destructor uses m_pStream (perhaps by flushing the stream). This means that the order of destruction is important — StreamWriter must be destroyed before its underlying Stream is.

Now let’s try to write and use some simple managed C++ wrappers for these classes:

#pragma managed

public __gc class ManagedStream
{
public:
    ManagedStream() : m_pStream(new Stream) {}

    // NOTE: This is a finalizer, not a determinstic destructor
    ~ManagedStream() { delete m_pStream; }

public private: // Make accessible by ManagedStreamWriter
    Stream __nogc* m_pStream;
};

public __gc class ManagedStreamWriter
{
public:
    ManagedStreamWriter(ManagedStream* pStream) :
        m_pStreamWriter(new StreamWriter(pStream->m_pStream)) {}

    // NOTE: This is a finalizer, not a determinstic destructor
    ~ManagedStreamWriter() { delete m_pStreamWriter; }

private:
    StreamWriter __nogc* m_pStreamWriter;
};

void f()
{
    ManagedStream stream = __gc new ManagedStream();
    ManagedStreamWriter streamWriter = __gc new ManagedStreamWriter(stream);

    // Use streamWriter

    // GC will clean up stream and streamWriter
}

See the problem?

As I (gratituously) hinted above, we have a problem due to nondeterminstic destruction. Since the GC does not define the order in which it will destroy managed objects, we cannot guarantee that the ManagedStreamWriter will be destroyed before the ManagedStream. If the ManagedStream is destroyed first, then its Stream will be deleted before the ManagedStreamWriter’s StreamWriter destructor is called. This means that StreamWriter will be using a deleted pointer — a sure recipe for disaster.

I can think of a few possible solutions to this problem:

  1. Have the managed classes implement IDisposable, and require developers to use it to achieve determinstic destruction. The main downside to this approach is “what if developers forget?”
  2. Recapture the interdependencies among unmanaged classes in the managed wrappers. For example, add a reference in ManagedStreamWriter to the ManagedStream object. This will force the GC to properly order their destruction, and is probably the right way to go.
%d bloggers like this: