Wednesday, January 12, 2011

Rounding Error

During regression tests earlier this week, our test team spotted a bug in our rules engine. I spent part of the day today getting to the bottom of it.

The rules engine can be setup to monitor event data as it streams in, and to identify event patterns. The engine tries to find a pattern defined by a rule in a stream of events. In this sense, rules are to streams of events as regular expressions are to strings of text.

The test team registered a rule for a simple pattern to look for: event A, followed by event B, followed by event C, occurring in a 2 minute time span. Pretty simple, straightforward stuff. The bug looked suspicious to me from the start because it fell into an area of code that had received considerable attention during our integration tests, and in my mind was solid.

The expected output from the engine, given this rule definition, is a set of detail records that capture the participant events in the sequence, in order. The engine creates a detail record for A, B, and C plus a parent D record. But the test team discovered that in some cases, detail records captured a slightly different sequence, most of the time with events A and B inverted.

After looking over the code in question, and having taken a closer look at the event data, the problem turned out to be a rounding error. At some point we had decided to round down centiseconds from DB2 timestamp values before sorting events, because we considered sub-seconds insignificant to real-time events. But the rounding was precisely the root of the problem today since we were seeing event timestamps with differences in the 1/100ths of a second.

No comments:

Post a Comment