The underlying storage engine that powers Trello’s data environment was unhealthy and sluggish, so we invested time into fixing it with the hope that our future selves could then focus on the data and the people who need it. So far it’s working (for example, we reduced the overall data size by 62%), but it won’t work forever.
The start of Trello’s data engineering team (December, 2017) was met with a large backlog of requests for improvements to our existing data pipelines and for surfacing new sources of information.
However, there were signs that all was not well with Trello’s overall data environment. After some investigation, we decided to invest in the health of the Redshift database that powers our Mode reports before before addressing the data request backlog.
Unhealthy Database: Symptoms
Some of the signs that Trello’s Redshift database cluster wasn’t in good health:
- Frequent long-running queries and reports that locked database resources, causing a pileup of work 🚦
- Complex queries that exceeded the default 45 minute time limit, causing Redshift to cancel them 😵
- An influx of requests to “pre-calculate” long-running queries (translation: adding code to our data pipelines and having to update it whenever these queries change) 😱
- Redshift was emitting a lot of query alerts 🚨
Unhealthy Database: Diagnosis
Redshift, How Does it Work?
When you’re running reports or writing queries, Redshift looks like a traditional relational database. Under the hood, however, there are some important differences:
- Redshift stores data by columns and not by rows.
- Redshift runs queries in parallel by dividing data and work among the slices that compose each node in a cluster.
Once you grok these differences, Amazon’s Redshift best practices start to make sense:
- Distribute data evenly between nodes: maximizes the benefit or parallel work.
- Store related data on the same node: prevents shuffling data between nodes during table joins.
- Compress data: reduces I/O when you can’t avoid shuffling data between nodes.
- Sort for your workload: reduces the amount of data that a query needs to access (for example, sorting by event date means that queries looking for yesterday’s events can ignore the older information).
After some sleuthing, we determined that we were doing a good job with data distribution (#1 and #2 above). Compression and sorting, however, needed some love.
As of January, 2018:
- Our Redshift cluster had four 16 terabyte (TB) nodes for a total capacity of 64 TB.
- We were at 80% capacity.
- A single table contained 42% of our overall data (
atomic.events, which captures every event emitted by Snowplow).
A closer look at that 36 gigabyte (GB) events table:
- It was 65% unsorted. Result: even queries that requested a small subset of events data had to scan all 60 billion rows.
- Many columns were compressed but weren’t taking advantage of newer techniques that would save additional space.
- It was growing rapidly…the table was 42 GB by the time we fixed it.
These findings also applied to many other large events-related tables, resulting in table scans, unnecessary shuffling of data, and slow queries that had a cascading effect on overall Redshift and Mode performance.
Unhealthy Database: Cure
Once we understood the underlying problems, the steps for fixing them were straightforward. We code-named this work “Project Roomba” because the database process that re-sorts data is called a VACUUM.
- Give ourselves room to work by doubling the size of the Redshift cluster to eight 16 TB nodes.
- Replace problematic tables with new versions that use high data compression encoding. The process of copying data to the new tables also re-sorts the data, solving both problems.
- Keep tables sorted on an ongoing basis.
As of April, 2018, when Project Roomba finished:
- Our Redshift cluster had eight 16 TB nodes for a total capacity of 128 TB.
- Our data size was 62% lower (20 TB, putting us at 16% of total cluster capacity). The extra space is itself a performance-booster, since it ensures adequate overhead for temp tables and other internal query work.
- The events table was 65% smaller at 15 GB.
- All events tables were 100% sorted.
Results and the Road Ahead
For lots of nerdy reasons, Project Roomba took several months. And there’s more to do.
- Our events table now has 85,020,697,646 rows. Implementing Redshift best practices bought us time, but we can’t keep the entire history of Trello events in a single table forever.
- We added a lot of snapshot data to support FY19 OKRs, and it needs some additional tuning.
That said, Project Roomba paid off:
- We no longer have to “kill” long-running queries that are impacting Mode reports.
- We don’t have to add workaround code for long-running queries into the data pipelines (less code, less problems!).
- Redshift is logging fewer query alerts.
- We have heard anecdotal reports of “supercharged” queries.
- We’re spending less data engineering time looking after Redshift and more time on creating and improving data pipelines.