Showing posts with label algorithms. Show all posts
Showing posts with label algorithms. Show all posts

Tuesday, January 18, 2011

Sequencing Algorithm, Least Recent

One of the problems mentioned in a previous post, remains unresolved. In my last post, I describe a better approach to find the most recent events in an event stream while matching a pattern. This post describes an approach to find the earliest instances of each of the elements of a pattern in a stream of events, or the least recent events.

The algorithm to find the most recent events performs essentially the same steps as the algorithm that finds the least recent ones. The key difference between the two is that the latter reads the event stream in ascending order by timestamp, and processes the elements of the pattern in order.

Like this:
String[] sequence = {"A", "B", "C"};
List<Event> result = new ArrayList<Event>();
for (String type: sequence) {
   for (Event e: events) {
      if (e.getType().equals(type)) {
         if (result.isEmpty()) {
            result.add(e);
            break;
         } else {
            Event x = result.get(result.size() - 1);
            if (e.compareTo(x) > 0) {
               result.add(e);
               break;
            }
         }
      }
   }
}
The first step processes the first element in the pattern, A, and loops over the event list until the first instance of A is matched. This event is always the first one to make it into the result list. No need to continue scanning at this point. Since the algorithm scans the event list in ascending order, all other instances of A are more recent, and should be skipped.

The next step is to find the least recent B that occurs after the A in the result list. If the least recent B has a timestamp greater than the timestamp of the A that is in the result list, then it is "in sequence"  chronologically, and may safely be added to the result:
Event x = result.get(result.size() - 1);
if (e.compareTo(x) > 0) {
   result.add(e);
   break;
}
The next element in the pattern is processed in the same manner: the least recent C that occurs after the B that is in the result list will be added. These steps are repeated until each element of the pattern is processed in order.

If the first 20 seconds of the minute long event stream we've been using as a test case looks like this:

B 10:00:00 <--
A 10:00:02
B 10:00:05
B 10:00:15
C 10:00:20
...

This version of the algorithm generates the expected output:

A 100002
B 100005
C 100020

Monday, January 17, 2011

Algorithm Win

A slightly more elaborate version of the event sequencing algorithm described in my previous post, expands on the tiny algorithm, and does a much better job at finding the most recent occurence of each event type in a pattern, without losing track of the time sequence.

This version of the algorithm uses a temporary data structure, an ArrayList<Event>, to store the result. Use of a temporary data structure allows the algorithm to keep track of the last item added into the result and to compare it to the events in the stream.

The other important distinction between this version of the algorithm and the tiny one is the order in which the pattern itself is processed: in reverse.

The most efficient way to build a sequential list of events that match a pattern, and that represent the most recent occurrences of each event type in the pattern, is to loop once through an array representing the pattern (in reverse), and within each loop, scan the event list (in descending order by timestamp) until there's an event that matches the first element out of the reversed pattern. Like this:
String[] sequence = {"C", "B", "A"};
List result = new ArrayList();
for (String type: sequence) {
   for (Event e: events) {
      if (e.getType().equals(type)) {
         if (result.isEmpty()) {
            result.add(e);
            break;
         } else {
            Event x = result.get(result.size() - 1);
            if (e.compareTo(x) < 0) {
               result.add(e);
               break;
            }
         }
      }
   }
}
The first pass through these two steps will always yield the most recent occurence of the last element in the pattern, the most recent C. The algorithm knows that it has found this particular item when an event matches the type in the current pattern element, and the result list is empty. When these two conditions are true, the result list may grow by one.
if (e.getType().equals(type)) {
   if (result.isEmpty()) {
      result.add(e);
      break;
   }
...
}
The first step then is to loop over the event list until the first occurrence of C is matched. This event is always the first to be added to the result list. It is not neccessary to continue looping over event list at this point looking for any more instances of C because all other instances of C in the list are older and can safely be skipped. Breaking the loop at this point saves the algorithm from performing unnecessary work.

Moving onto the next element in the pattern, the next step is to find the most recent B that is older than the C that is in the result list. If/when this event is matched while scanning the event list, we can safely add it to the result, break the loop, and move onto the next element in the pattern. The key to maintaining the time sequence in the result is simple: don't add an event to the result list unless it's timestamp is earlier than the timestamp of the last event added.
Event x = result.get(result.size() - 1);
if (e.compareTo(x) < 0) {
   result.add(e);
   break;
}
This step guarantees that the most recent B that has an earlier timestamp than the C that is in the result list, will be added to the result. The same step is applied to the next element in the pattern: the most recent A that has an earlier timestamp than the B that is in the result list will be added to the result.

The algorithm continues to move through the elements in the pattern, compares the timestamps of the events in the list to the last element in the result, and breaks (stops) the event list loop when a match is found. These steps are repeated until the end of the pattern is reached.

Even if the last 20 seconds of the minute long event stream look like this:

...
A 10:00:40
B 10:00:45
C 10:00:50
B 10:00:55
A 10:00:56 <--
C 10:00:59

This version of the algorithm generates the expected output:

A 10:00:40
B 10:00:55
C 10:00:59

And the flaw in the tiny version of the algorithm is removed. FTW.

One thing to keep in mind is that not all event streams show a pattern. What if the only event that occurs during the minute long event stream is A?

A 10:00:00
A 10:00:02
A 10:00:05
A 10:00:10
A 10:00:15
A 10:00:20
...

The algorithm will find the most recent A, no problem, but because there is no pattern A > B > C in the stream, an additional check should be added to validate the size of the result against the size of the pattern. The result list is valid, and should be output, only when it has as many elements as the pattern:
if (result.size() == sequence.length) {
   Collections.reverse(result);
   for (Event e: result) System.out.println(e);
}

Saturday, January 15, 2011

Algorithm Fail

Part of the work on the rules engine last year involved writing an algorithm to find a pattern of events in an event stream. The problem seemed simple to solve, initially, but the algorithm grew more elaborate as I ran different event streams through it.

The algorithm takes two inputs: an event pattern such as "event A followed by event B, followed by event C", and a linear stream of events, with a type and time of occurence:

A 10:00:02
B 10:00:05
B 10:00:15
C 10:00:20
C 10:00:25
A 10:00:30
A 10:00:35
A 10:00:40
B 10:00:45
C 10:00:50
B 10:00:55
C 10:00:59

The list above represents 12 events that happen within a one minute time span. The algorithm needs to find the events in the stream that match a pattern, within this time span or any other. It also needs to provide an option to find the least recent or the most recent occurrence of each type of event in the pattern.

To find the least recent occurrence of A, B, and C in a stream of events, the simplest approach is to loop once through each of the elements in an array representing the sequence, and within each loop, scan the event stream until there's an event that matches the type in the sequence.
String[] sequence = { "A", "B", "C" };
for (String t: sequence) {
  for (Event e: events) {
      if (e.getType().equals(t)) {
         System.out.println(e);
         break;
      }
  }
}
One pass through the algorithm produces:

A 10:00:02
B 10:00:05
C 10:00:20

This tiny algorithm also finds the most recent occurrence of A, B, and C in a stream of events, if we reverse the order of the event list (sort the list in descending order by timestamp), and give it the reversed event list:
Collections.reverse(events);
With the event list reversed, the algorithm produces:

A 10:00:40
B 10:00:55
C 10:00:59

So far, so good: the algorithm is efficient, doesn't do any more than it needs to, and gives us what we want. But real-time events don't always come in neat sequences. This approach to find the most recent occurrence of A, B, and C breaks down with just a minor variation in the stream, for example, if A happens right before the last C. The following partial stream of events from the last 20 seconds of the minute would look like this:

...
A 10:00:40
B 10:00:45
C 10:00:50
B 10:00:55
A 10:00:56 <--
C 10:00:59

With this minor variation in the stream, the simple algorithm produces:

A 10:00:56
B 10:00:55
C 10:00:59

While the events selected by the algorithm are the most recent of each event type we're looking for, the output is not in sequence. As you can see, the instance of B in the output occurs one second before the instance of A. This is a problem.

The same strategy used to find the least recent occurrence of A, B, and C also breaks down if B is the first event to occur within this time span of events. In ascending order, the stream of events from the first 20 seconds of the minute would then be:

B 10:00:00 <--
A 10:00:02
B 10:00:05
B 10:00:15
C 10:00:20
...

Given this minor variation in the event stream, the algorithm generates:

A 10:00:02
B 10:00:00
C 10:00:20

Again, the output represents the least recent of each event type in the pattern, but the instance of B in the output occurs two seconds prior to the instance of A, so the algorithm is flawed because it loses track of the time sequencing when it moves to the next element in the pattern and breaks the sequence. We need a better strategy.

The next post explains a better, slightly more elaborate strategy.