29 Nov Business data anomaly detection using stream-based processing and Amazon Kinesis Analytics
Recently we’ve been performing “on the fly” anomaly detection on a stream of business data to look for exceptions that a human monitoring process cannot identify, either from a quality/capability or sheer volume point of view. Simple thresholding is also too crude to identify the anomalies we are looking for, so a more sophisticated approach is called for. Business scenarios where this might be useful is identifying unusual usage patterns from users that might indicate fraud or inefficient usage of assets (e.g. the business equivalent of “leaving the lights on”).
This exploits a really interesting built-in function in the SQL-like language using Amazon Kinesis Analytics called RANDOM_CUT_FOREST, which is derived from the work described in this paper. There are cloud-based anomaly detection approaches aimed at operating on batches of data like these SVM and PCA-based ones from Azure, but we’re specifically interested in stream-based approaches in this post.
Here’s an example of anomalies plotted using Kibana for time-series data against a multi-dimensional feature space – we have all our data from the output Kinesis stream loaded into the new Elasticsearch v5 release. Note that only the most significant feature is plotted on the left hand axis for simplicity, and that in all these plots that anomaly scores have been multiplied by 100 (on the right hand axis in this plot).
Interpreting Business Data Anomaly Scores
The Random Cut Forest algorithm takes the following parameters:
RANDOM_CUT_FOREST (inputStream, numberOfTrees, subSampleSize, timeDecay, shingleSize)
The anomaly scores have to be interpreted on a case-by-case basis depending on the business data and the parameter configuration which controls sensitivity, but it’s useful to know what the maximum extent is to ease interpretation. It’s not clear from the documentation, but from this forum discussion you can see that the algorithm produces anomaly scores ranging up to
log2(<<subSampleSize parameter>>) so for this example that’s log2(10) = 3.321928.
Here’s the distribution of anomaly scores that backs up that prediction (remember these values are multiplied by 100)…as you’d expect most items are not anomalous, and there’s a long tail of very few obviously more anomalous records.
Limitations to be aware of
Looking at the graph above, you can see a small number of zero anomaly scores. This is because the sliding window algorithm needs a certain amount of historical data to “pump prime” before it can start giving reasonable scores, which makes sense. Looking at this in a time-series view, you can see that all the zero scores occur at the start of stream processing run.
Another thing to be aware of – which is obvious in the documentation if you read it carefully but it’s easy to make an incorrect assumption on – is that the algorithm does not support categorical variables, but if you pass them into the algorithm it doesn’t complain and just ignores them and returns them in the resulting SQL expression. Only numeric types are supported, so if you want to include categorical data in your anomaly detection then you need to use a technique like OneHotEncoding (there’s a good discussion of alternative approaches here) to convert them to numerical data first – something I’ve not tried out yet to see how this specific algorithm handles it. One issue here is that we’re limited to 30 input variables, and that will quickly blow up with categorical variables typically.
Charging for the service is by Kinesis Processing Units (KPU) per hour, which at the time of writing was $0.12 per KPU-hour. As with many things cloud, my run costs are pretty trivial although it obviously depends on how much data you’ve got passing through your Kinesis stream. My application happily runs with a single KPU. One slight grumble at the moment is that you can’t tell how many KPUs have been automatically allocated by the AWS platform at run time. You can only find out retrospectively by looking at your billing data. AWS tell me that it’s “a known issue that we will be addressing as soon as we can“, so hopefully we’ll have a CloudWatch metric for it soon. In practice it’s not a big problem if your workload is relatively steady and/or predictable – and then the costs are probably relatively trivial anyway.
In the interests of balance, for a Microsoft Azure/Google perspective here are some other links to look at. Everyone’s at it!