You Can’t Improve What you Don’t Measure. At Battlefy we try our best to make our data as transparent as possible across the whole organization. In this article I’ll describe what our current data pipeline setup is to support this goal.
To make our data more accessible for not only developers, but also product managers, designers, sales, and marketing people, we have chosen to integrate with an analytic tool PeriscopeData. It makes creating beautiful data visualizations and auto-updating dashboards dead easy for anyone. All you have to do is writing a few lines of SQL statements and a couple clicks. PeriscopeData works with most SQL databases.
In our first ETL iteration, we setup a MySQL server as our analytics database. MongoDB has a mongoexport utility that can export MongoDB collections into CSV or JSON files. However, we’d like to do a few data transformations and cleansing before loading it into the analytical database. Hence, in this iteration, we have created two sets of scripts: one set exports MongoDB’s data into CSV files by collection, or CSV files that describe many-to-many relationships. The second set of scripts batch insert data from CSV into MySQL database. We run the two sets of scripts periodically using Jenkins.
This simple setup worked perfectly at the beginning. As PeriscopeData enabled more people to fiddle with the data, our data visualizations and queries increased in both numbers and complexity. In the meantime our OLTP data becomes exponentially larger, which we certainly won’t complain. All those factors took a toll on our little MySQL server. Eventually we saw more and more timed-out queries. At this point we adopted an easy upgrade: instead of hosting our own MySQL server, we loaded the data into Amazon Aurora. Nevertheless, although the total timed-out queries dropped significantly initially, we still observed a steady increase of timed-out queries as our transactional database grew bigger.
Aside from slow or timed out queries, we also faced a few other challenges. The main one was that the ETL took a much longer time to complete. The biggest reason behind was the hard disk IO for CSV creation and network IO for batch insert. Because we used the Jenkins box to do a few other housekeeping tasks, sometimes we also saw that the batch insertion scripts run out of memory as they needed to load those huge CSV files.
After another failed ETL run because of the memory issue, we decided that enough was enough. It’s time for a second iteration on our ETL pipeline. At the design phase we tested a few solutions and we were just simply amazed at the speed Amazon Redshift can run its queries. When I was working with MySQL or even Aurora, I always had a few minutes between starting a SQL query and getting some results. Throughout the day I’ve got enough time to argue in some heated conversations on Hacker News. With Redshift however, for most queries I couldn’t even get the webpage to finish loading before they finish. Another bonus point is that Redshift supports Postgres 8.0’s SQL syntax, which gives us all the sweet sweet window functions, common table expressions and more.
To load data into Redshift, and to solve our existing ETL problems, we first tried to find the best way to load data into Redshift. There are some nice articles by PeriscopeData. The best result we found was to save JSON files in AWS S3 corresponding to the respective Redshift tables, and use the COPY command to load the JSON files in. For easiest configuration, we took advantage of the
JSON ‘auto’ option of the COPY command. Redshift will try match table columns with JSON object keys so we don’t have to manually setup all the matching. Another option that is useful is the
GZIP option. When we generate the JSON file we immediately Gzip it while uploading to S3. This way we saved a huge amount of network IO.
Our copy command script looks like this:
The next step, or rather the previous step is to save all the MongoDB data into Gzipped JSON format and upload them into S3. To reduce disk IO we simply don’t save the JSON on the server’s disk anymore. Instead, we uses MongoDB’s handy stream API. For MongoDB Node driver, both the
collection.find() and the
collection.aggregate() return cursors.
collection.find’s cursor has a documented
stream method. And while
AggregationCursor didn’t have it documented yet, it also supports the
stream method. The
stream method accepts a
transform function as parameter. We support it with our own functions to transform MongoDB documents into Redshift table rows.
One thing worth mentioning about the transform function is that all it needed to return is a string which should be able to be evaluated as JSON objects later on. So we can actually transform a single MongoDB document into several Redshift entities/rows. Coupled with MongoDB 3.2’s left join, this makes constructing many-to-many relationship tables a breeze.
The MongoDB Node
stream method returns a readable stream, we pipe it into zlib to gzip it, then we feed the readable stream into AWS S3’s node library’s
upload method, which gracefully accepts readable streams too.
With all the stream pipeline setup, we saw some nice improvements on disk/network IO, as well as memory usage. We have been relying on this setup for a while and so far our ETL needs minimal maintenance.
Furthermore, by replacing CSV with JSON as the intermediate format, we don’t have to do format conversions anymore, since MongoDB documents are JSON documents essentially. And working with JSON rather than CSV imposes much less of a cognitive overhead for our team as we deal with JSON on other parts of our systems on a daily basis.
There are still lots of improvement opportunities ahead. One of the biggest problems is that with each ETL, we erase all the old data and reload the whole database altogether. We haven’t found an easy and reasonably priced solution to achieve incremental loading data into Redshift. If you have any thoughts on how our current ETL can be improved, please kindly share with us!