If you have built professional websites or mobile apps, you’ve probably dealt with many different web analytics tools that track user interactions and funnels. They all have their strengths and weaknesses. One thing in common though, we found it uncanny that those tools made it very hard to let us access the raw data we allow them to collect. It’s all fine and dandy if you are content with their charting tools and strictly defined APIs. Anything more customized, though? Consider yourself lucky if you are able to wrestle your way out with some frankensteiny tools such as JQL.
One sunny day I was pulling my hairs dealing with funnels as usual, I came across a video about how game developers track user interactions in video games.
It is safe to skip the video and continue reading
The video mentioned that to be able to perform intricate analysis on how players interact with in game environment, game devs have to have data collecting pipelines that are easy to customize for different types of data. And the central piece of this pipeline is AWS Kinesis Firehose. I decided to give it a try right away to avoid further hair pulling. And my CTO is paying the AWS bills so why not.
We have been using AWS Redshift for a while and it’s surely not an easy task to load data into it. But, if all you need is to load immutable data into Redshift, you are in luck. That is, if you can be sure that there is no need to update the data once it’s populated, Kinesis Firehose makes importing data into Redshift extremely simple. And storing user interaction events is one of such use cases.
When you setup a Kinesis Firehose, you can choose which Redshift cluster to dump the data into. Provide it with the appropriate COPY command and you are mostly done! We are sending JSON records, so the COPY command looks like:
As you’d guessed, Kinesis Firehose pipes the data into S3 first, batching them into separate files, and load them into Redshift. Firehose also creates manifest files, so that Redshift knows which file to load. With the manifest files it wont load the same file twice. You can configure the size of the file for each batch.
On the other end of the Firehose, you can choose the source of the data from either Kinesis stream or PUT requests. In order to have our frontend JavaScript be able to send data to it securely, we setup an AWS Lambda and API gateway in front of our Kinesis Firehose.
We created a AWS Lambda function with an API Gateway trigger. API Gateway is set to access Http POST at a secret route. And the Lambda function is really just out of box code to feed the POST request into Kinesis Firehose:
aws-sdk
is available for AWS Lambda so you don’t have to zip and upload it. Inside the deliverToRedshift
function, other than adding a timestamp, the data is fed into Firehose without other changes.
When setting up Kinesis Firehose, you probably noticed that you can also provide an AWS Lambda to transform source records before delivery. If you do need data transformation, this is the best place for it.
This pipeline loads data into Redshift in near real time, normally with no more than 5 to 10 minutes of delay, depending on your configured batching size. Firehose console provides some nice charts to monitor the rate of the delivery and alarms if something goes wrong.
This article described the general architecture of our new data pipeline for user interaction tracking. It took me about one day to learn and setup. It has been online without any issues ever since.
By pumping frontend event data into Redshift, you are not only regaining control of your data, but also could save a good chunk of money switching from frontend analytics SaaS*. One last thing worth mentioning is that the JS libraries from those frontend analytics SaaS are prune to be ad-blocked. If not designed carefully, it could cause collateral damage to real functionalities of the web page. And that’s not a concern for us anymore!
*You might have to write some SQL queries.