Saturday, January 15, 2011

Algorithm Fail

Part of the work on the rules engine last year involved writing an algorithm to find a pattern of events in an event stream. The problem seemed simple to solve, initially, but the algorithm grew more elaborate as I ran different event streams through it.

The algorithm takes two inputs: an event pattern such as "event A followed by event B, followed by event C", and a linear stream of events, with a type and time of occurence:

A 10:00:02
B 10:00:05
B 10:00:15
C 10:00:20
C 10:00:25
A 10:00:30
A 10:00:35
A 10:00:40
B 10:00:45
C 10:00:50
B 10:00:55
C 10:00:59

The list above represents 12 events that happen within a one minute time span. The algorithm needs to find the events in the stream that match a pattern, within this time span or any other. It also needs to provide an option to find the least recent or the most recent occurrence of each type of event in the pattern.

To find the least recent occurrence of A, B, and C in a stream of events, the simplest approach is to loop once through each of the elements in an array representing the sequence, and within each loop, scan the event stream until there's an event that matches the type in the sequence.
String[] sequence = { "A", "B", "C" };
for (String t: sequence) {
  for (Event e: events) {
      if (e.getType().equals(t)) {
         System.out.println(e);
         break;
      }
  }
}
One pass through the algorithm produces:

A 10:00:02
B 10:00:05
C 10:00:20

This tiny algorithm also finds the most recent occurrence of A, B, and C in a stream of events, if we reverse the order of the event list (sort the list in descending order by timestamp), and give it the reversed event list:
Collections.reverse(events);
With the event list reversed, the algorithm produces:

A 10:00:40
B 10:00:55
C 10:00:59

So far, so good: the algorithm is efficient, doesn't do any more than it needs to, and gives us what we want. But real-time events don't always come in neat sequences. This approach to find the most recent occurrence of A, B, and C breaks down with just a minor variation in the stream, for example, if A happens right before the last C. The following partial stream of events from the last 20 seconds of the minute would look like this:

...
A 10:00:40
B 10:00:45
C 10:00:50
B 10:00:55
A 10:00:56 <--
C 10:00:59

With this minor variation in the stream, the simple algorithm produces:

A 10:00:56
B 10:00:55
C 10:00:59

While the events selected by the algorithm are the most recent of each event type we're looking for, the output is not in sequence. As you can see, the instance of B in the output occurs one second before the instance of A. This is a problem.

The same strategy used to find the least recent occurrence of A, B, and C also breaks down if B is the first event to occur within this time span of events. In ascending order, the stream of events from the first 20 seconds of the minute would then be:

B 10:00:00 <--
A 10:00:02
B 10:00:05
B 10:00:15
C 10:00:20
...

Given this minor variation in the event stream, the algorithm generates:

A 10:00:02
B 10:00:00
C 10:00:20

Again, the output represents the least recent of each event type in the pattern, but the instance of B in the output occurs two seconds prior to the instance of A, so the algorithm is flawed because it loses track of the time sequencing when it moves to the next element in the pattern and breaks the sequence. We need a better strategy.

The next post explains a better, slightly more elaborate strategy.

No comments:

Post a Comment