Now with GDPR, data leaks can result in fines in the hundreds of millions. Hackers don’t even need to be involved in data leaks as one common issue is mixing public and personally identifiable information. This can lead to PII being accidentally leaked when intending to display public data.
This blog post will show a simple oversight during modelling of a problem can lead to data being leaked. We will then show how to fix the design to make it impossible to leak the data.
At Battlefy, we are building esports for competitive players and we want to give them the ability to express themselves. We want to promote people in esports and grow it all for everybody.
Furthermore, Battlefy’s clients are large and well-known companies. We have to ensure the esports events we run uphold our client’s brand.
These two desires end up butting heads for features as simple as being able to upload a team logo. We want teams to be able to bolster their brand with a logo, but at the same time, we need to prevent trolls from uploading offensive images. These offensive images would hurt our client’s brand.
We solve this problem by adding an approval process. Our esports operation staff approve each team's logo before they are displayed publically.
Here is a system diagram for the team logo approval process. Let’s describe each step in more detail.
logoState
is initialized to pending
.logoState: 'pending'
.logoState
to approved
.logoUrl
field if logoState
is not equal to approved
. Failure to implement this logic will result in logoUrl
leaking.logoUrl
field, it loads the team logo. If the logoUrl
field is missing the frontend displays a placeholder image.Step 8 is the critical step that can lead to a data leak. Notice how easy it would be to forget to implement this logic and the team pages would still work. Also, notice how every single query in other parts of the system that load teams must never load all the fields or implement this critical piece of logic. This landmine waiting for a victim.
How do we fix this leak? One natural inclination would be to have all the images uploaded to the CDN be private and only upon approval make them public. But then this makes it awkward for teams to see their own team logo pending approval. Sure we could add roles to all teams and admins to be able to always read images, but this is getting complicated. There is a better way.
The root cause of the problem is the logoUrl
and logoState
don’t belong on the team document. logoUrl
is serving two different purposes. It is simultaneously the pendingLogoUrl
and approvedLogoUrl
. pendingLogoUrl
should never be returned to the frontend, whereas approvedLogoUrl
should always be returned to the frontend.
We can fix this by moving logoUrl
and logoState
into their own collection. We could easily call this collection team logo review
but considering this problem isn’t a one-off, we can easily generalize this into a feature with just a few additional fields.
This updated system diagram shows how we segregated logoUrl
and logoState
into a new image review
collection. It contains documents with a type
field and for our use-case type: 'team logo'
means there will be a teamID
field.
Neither the player
nor viewer
backends ever have a reason to query the image review
collection. Only the admin
backend uses it to manage the approval process.
Instead of having the frontend rely on an optional logoUrl
field on the team document, it simply loads the image at <cdn uri>/team/<team id>/logo.png
. The frontend already has the team ID when it received the list of team documents. If the team logo doesn’t exist in the CDN, the frontend shows a placeholder.
<cdn uri>/team/<team id>/logo.png
is populated after the admin approves the image. It is copied from the previously uploaded image.
Comparing the two designs, note how a minor oversight of adding logoUrl
and logoState
to the team document lead to a leak. This decision seems so trivial at the time yet has dire consequences down the road.
To avoid this trap, one must think critically when designing schemas. Separate concepts and hence collections. Putting everything into one document is sus. Err on the side of having too many collections than too few.
This example only covered a very minor leak. While this kind of leak would reflect poorly on Battlefy and could potentially hurt our clients, it would not constitute a GDPR violation.
For cases where PII is being handled, it is not as simple as segregating the data into different collections. Accidental leaks in the form of logoUrl
would-be prevented, but handling PII requires more auditing and security. One simply does not colocate normal data with PII data.
One solution would be to create a separate database to only house PII and severely limit access to it, even more so than the regular production database. It becomes much easier to handle PII data appropriately when one can put a wall around it.
Do you want to sweat the details on every single database field? You’re in luck, Battlefy is hiring.