At Battlefy we have experienced some remarkable user growth. Month over month, the number of tournaments our organizers are running are growing at an accelerated rate and so is the the number of participants in these tournaments.
Quantitatively, the majority of our users are players who participate in tournament events. Instead of having a smoother traffic pattern, we see our traffic suddenly peaks when large events kick off. Due to this special usage pattern, we normally don’t have enough time for servers to automatically spin up when the traffic starts rising.
On the other hand, if our website goes down in the middle of some tournaments, those events cannot simply pause until the service goes back online. To ensure the fairness of the tournaments and keep up with players’ schedules, many tournament organizers have to reschedule part or the whole tournament to another date if our website goes down even for a brief period. We understand that our service is mission critical for our users.
With those two factors in mind, we have to design our system with additional reliability requirements — this is where performance testing comes into play.
Our overall architecture is three micro frontend client applications talking to an array of backend Nodejs services via Restful API calls. To ensure the scalability of our frontends, we deployed them into third party CDNs. The scalability and reliability concerns for us are mainly our backend services and how they interact with our various databases.
To find out the performance parameters of our backend configurations, we’d like to record API traffic patterns during some major tournament events, and then playback with specific multipliers.
We setup logging in our backend for API requests/responses, and sent those logs into Sumologic. Sumologic provides log aggregation searches which tell us which API routes has been called, with what parameters, how many times, and response time. Now we just need to find a tool that can playback those API calls.
JMeter is one of the most popular stress testing tools. It has an exhaustive list of features for any stress testing needs. We used it for our first performance testing attempt. We setup the test script via its GUI and run the test both on our local machine and in a EC2 box. Very easily the test server got saturated by network traffic.
This is to be expected, as we are using one test server against a cluster of our staging backend servers, which resembles our production setup. Now the option we have with JMeter is to setup JMeter distributed testing, where one JMeter master controls a specific number of JMeter slave servers.
The challenge to configure more complicated JMeter setup is that we have to work with either GUI, or XML, or Java. None of those are easy to work with for our JavaScript team. Furthermore, our stress testing objective is a fairly straightforward one: all we need is to make a ton of API calls. For us the endless features JMeter provides are more of an overhead.
While we were playing with Jmeter we were still open for other tools. And we found a much simpler tool that meet our requirement better: Bees with Machine Guns. Essentially it helps create many micro EC2 instances to fire API calls to load test target. It has a nice and simple command line tool, and you can use it as a python lib for more complex tasks.
Rather than dive deep into python land from there. we were wondering if we can do something similar in Nodejs, or even simpler. At that time AWS lambda just launched not long ago, and there were lots of serverless architecture prototypes on Show HN. We consider load testing would be a perfect use case for AWS lambda. The famous cold start problem that plagues those serverless prototypes is irrelevant in a load test setting.
We quickly built a prototype where we use the loadtest lib on AWS lambda functions, each function invocation makes as many API calls as they can. Single lambda function invocation has a fairly low resource limit, but it’s so easy to just ask for an magnitude more concurrent invocations rather than fine tuning around those limits. We actually hit the concurrent invocation limit but AWS support was helpful to increase it to a number that is more than enough for us.
Our load test application performs the following tasks when it’s executed:
The API calls those AWS lambda function invocations are making can be of any HTTP methods, with query parameters or request bodies. Some of those calls has to be made when the caller is logged in/authenticated. We use JWT token for authentication and the token can be passed around easily.
By leveraging AWS lambda, our load test application is tiny, easy to manipulate and yet powerful enough. The cost of running those performance test on AWS lambda is also minimal comparing with alternatives. AWS lambda try to keep up with latest Nodejs runtime, so we can work within our ES6 comfort zone.
With frequent load testing, we understand our architecture’s performance characteristics deeper, which leads to performance improvements on our existing and new feature implementations.
While essential, load testing is only a small piece on Battlefy’s whole reliability blueprint. We are constantly looking for new ways to improve our website’s performance and ultimately bring the best competitive gameplay experience to our players.