Monday, January 17, 2011

Algorithm Win

A slightly more elaborate version of the event sequencing algorithm described in my previous post, expands on the tiny algorithm, and does a much better job at finding the most recent occurence of each event type in a pattern, without losing track of the time sequence.

This version of the algorithm uses a temporary data structure, an ArrayList<Event>, to store the result. Use of a temporary data structure allows the algorithm to keep track of the last item added into the result and to compare it to the events in the stream.

The other important distinction between this version of the algorithm and the tiny one is the order in which the pattern itself is processed: in reverse.

The most efficient way to build a sequential list of events that match a pattern, and that represent the most recent occurrences of each event type in the pattern, is to loop once through an array representing the pattern (in reverse), and within each loop, scan the event list (in descending order by timestamp) until there's an event that matches the first element out of the reversed pattern. Like this:
String[] sequence = {"C", "B", "A"};
List result = new ArrayList();
for (String type: sequence) {
   for (Event e: events) {
      if (e.getType().equals(type)) {
         if (result.isEmpty()) {
            result.add(e);
            break;
         } else {
            Event x = result.get(result.size() - 1);
            if (e.compareTo(x) < 0) {
               result.add(e);
               break;
            }
         }
      }
   }
}
The first pass through these two steps will always yield the most recent occurence of the last element in the pattern, the most recent C. The algorithm knows that it has found this particular item when an event matches the type in the current pattern element, and the result list is empty. When these two conditions are true, the result list may grow by one.
if (e.getType().equals(type)) {
   if (result.isEmpty()) {
      result.add(e);
      break;
   }
...
}
The first step then is to loop over the event list until the first occurrence of C is matched. This event is always the first to be added to the result list. It is not neccessary to continue looping over event list at this point looking for any more instances of C because all other instances of C in the list are older and can safely be skipped. Breaking the loop at this point saves the algorithm from performing unnecessary work.

Moving onto the next element in the pattern, the next step is to find the most recent B that is older than the C that is in the result list. If/when this event is matched while scanning the event list, we can safely add it to the result, break the loop, and move onto the next element in the pattern. The key to maintaining the time sequence in the result is simple: don't add an event to the result list unless it's timestamp is earlier than the timestamp of the last event added.
Event x = result.get(result.size() - 1);
if (e.compareTo(x) < 0) {
   result.add(e);
   break;
}
This step guarantees that the most recent B that has an earlier timestamp than the C that is in the result list, will be added to the result. The same step is applied to the next element in the pattern: the most recent A that has an earlier timestamp than the B that is in the result list will be added to the result.

The algorithm continues to move through the elements in the pattern, compares the timestamps of the events in the list to the last element in the result, and breaks (stops) the event list loop when a match is found. These steps are repeated until the end of the pattern is reached.

Even if the last 20 seconds of the minute long event stream look like this:

...
A 10:00:40
B 10:00:45
C 10:00:50
B 10:00:55
A 10:00:56 <--
C 10:00:59

This version of the algorithm generates the expected output:

A 10:00:40
B 10:00:55
C 10:00:59

And the flaw in the tiny version of the algorithm is removed. FTW.

One thing to keep in mind is that not all event streams show a pattern. What if the only event that occurs during the minute long event stream is A?

A 10:00:00
A 10:00:02
A 10:00:05
A 10:00:10
A 10:00:15
A 10:00:20
...

The algorithm will find the most recent A, no problem, but because there is no pattern A > B > C in the stream, an additional check should be added to validate the size of the result against the size of the pattern. The result list is valid, and should be output, only when it has as many elements as the pattern:
if (result.size() == sequence.length) {
   Collections.reverse(result);
   for (Event e: result) System.out.println(e);
}

No comments:

Post a Comment