Our last post outlining our tech stack was a mere four months after Trello launched. More than three years and 8 million members later, we’re serving 1300 requests per second. It’s safe to assume a few things have changed.
I know some of you read the last post, but let’s start from the top.
The web client is mostly separate from the server these days. If you’re interested in how this works, Doug’s got you covered. The short story is: if the server sees a GET request it can’t identify, you get a little chunk of html that represents the client.
HAProxy load balances between our web servers, as well as between processes on each server. It leaves connections open for a long time, which allows us to use WebSockets. HAProxy also handles our ssl termination (previously this was done in node).
Between processes on each server, you say? We had some trouble with Node’s cluster module delegating load balancing to the OS, which resulted in most connections going to one or two processes. Instead of using cluster to load balance between 32 workers, we run 16 instances of Trello on each server, each with two workers, and use a local HAProxy to do the balancing between them. Although this has since been fixed in Node, we don’t have a pressing reason to abandon our current setup. In fact, it actually performs better - if a single request bogs down a process, that process’s sibling will gladly handle all of the work delegated to the pair until the first recovers. Yes, this means we’re explicitly turning off the new round-robin behavior in
We use Redis for anything that needs to be shared between servers but not persisted to disk. This is split into two different instances, to allow for different needs.
Our cache server runs with
allkeys-lru enabled. Anything that we’d like to hold on to, but wouldn’t be too upset about losing if we don’t touch it for awhile, goes here. Among other things, our instant update queues reside here.
Our store server is expected to not run out of memory. This is for data that either expires or otherwise isn’t expected to grow without bound. The canonical use-case for this is synchronization between servers.
You may remember that we used to use Redis for PubSub, as well. This recently moved to RabbitMQ - see below.
MongoDB is our “real” database. It writes to disk and everything.
With Mongo, we initially went crazy with denormalization. It turns out, though, that without transactions it is pretty difficult to keep everything in sync. We’ve since moved to a mostly-normalized schema. However, there are a few key areas in which we’ve kept the denormalization for performance reasons. These are in insert-only (or nearly so) but ready-heavy areas, where we can get big gains without giving up consistency.
Although this has tailed off as Trello has stabilized, we picked up a lot of development speed early on from not needing to do database schema migrations. Among other things, this makes bisecting quicker than it would otherwise be, and means that pretty much all updates to the server could be easily rolled back.
We’ve had to shard our Mongo as our dataset has grown. We currently run 19 shards, each containing three nodes - a primary, a hot secondary, and a one-hour-delayed slave. The primary/hot secondary give us resilience to single node failures, while the delay slaves give us extra peace of mind about data-corruption scenarios, and is also where we take LVM snapshots.
We use sharding tags to determine where data should go. This allows us to keep our smaller collections unsharded, and to put collections with heavier access patterns on [shards that only belong to them, but worded better]. For example, four of those shards are responsible for a single collection.
Although we’ve had problems here and there with Mongo, at this point we seem to have it running smoothly. Or, at least, our hardy sysadmins are fixing things quickly enough that I don’t notice the problems.
- started with master/slave setup
- then a replicaset
- sharded replicasets
- 57 nodes
- 19 shards
- shard by collection
- no universal shard key
- mongo tagging - 4 action-only shards
- lvm snapshots of delayed nodes DO WE WANT TO TALK ABOUT THIS?
- issue - readahead set too high killed trello
- good description: http://www.kchodorow.com/blog/2012/05/10/thursday-5-diagnosing-high-readahead/
- mongo (the company) told us what was up
As I mentioned earlier, RabbitMQ is our replacement for Redis PubSub. Unfortunately, Redis PubSub just isn’t meant for the scale that our instant update infrastructure is moving towards. It has also been our biggest single point of failure for awhile, although to Redis’s credit it doesn’t seem to fail.
We have a few requirements that together make this a bit complicated:
- Fairly high message throughput - we currently peak in the high hundreds to low thousands per second.
- High subscribe/unsubscribe rate - WebSockets are constantly connecting/disconnecting. In order to avoid overloading the web servers with messages they don’t care about, we need proper routing.
- Lack of a shard key - each client opens a single socket and requests updates to potentially a dozen or more boards. There’s no way to ensure that each web server only has to deal with a known subset of the messages.
We’ve only had this running for a few weeks now, so I won’t go into too much detail. Look for a future post with specifics about how we’ve solved this.
- talk about use of queues for e2b, BC exports, etc?
By early 2014, there was a lot of data in Trello. MongoDB was handling a bare-bones search over card titles and descriptions, but the lack of a fulltext search over comments and checklists was becoming a real issue. We wanted to implement search in such a way that searching for a term in a card title combined with a term contained in a checklist on the card and a term in a comment would find the card: that is, we wanted to treat the card as the document level.
There were basically 3 choices: roll our own, use Apache Solr, or use ElasticSearch. Fog Creek was having good results with ElasticSearch and it was clearly moving forward, so we went with ES.
We had to decide how to do a combined search over a Card, its Checklists, and its Actions (which are modeled separately in our database). We could either use nested documents (and incur a higher indexing cost because the entire card parent and all subdocs would have to be reindexed for each new comment or checkitem on a card) or use parent-child relationships (and incur a much higher cost per search and memory usage). Since our update rate was very large and our search rate very small, we went with parent-child. We are waiting to see what happens with https://github.com/elastic/elasticsearch/issues/6107, and pending performance gain from that may move to nested docs.
We use an ElasticSearch river (a data-slurping java plugin to ElasticSearch) to tail the MongoDB oplogs for document updates and insert or update the corresponding values in ES. Rivers are being deprecated in ES 1.5. We agree that rivers are a bad idea and their deprecation is appropriate. However, we like the approach of taking updates from the MongoDB oplog, and will almost certainly move to a separate service that tails oplogs and writes to ES. [Update with talk about MongoStream]
Overall, ElasticSearch has been a good tool to offer the search experience we want to provide; however, our operational costs have been on the high side, due to the fairly new and still-buggy river software and possibly due to our choice to use parent-child relationships.
The Hammer is another microservice. It adds artificial load to our infrastructure, and shuts itself down if anything gets too slow. This means that we can see scaling thresholds coming from miles away - an ability that has proven invaluable over the years.
We also shut it down during server deploys, which are especially expensive thanks to the hundreds of thousands of WebSockets that have to reconnect. Knowing we have the extra capacity to handle the stress of a deploy means that we’re not averse to deploying often - about 4 times each day, on average.
The Hammer hits our infrastructure in a number of ways. The simplest is that it makes a GET request for a board - stressing our web servers and Mongo setup. It also does set/get round-trips in Redis and Mongo, and publish/subscribe round-trips in RabbitMQ.
It is designed to be easily configurable - we occasionally turn up the load generation to prevent it from being drowned out by our ever-increasing real usage.
The realtime aspect of Trello is key to the whole experience.
You may remember from our last post that we were using Socket.io but having some trouble with it. After bleeding all over it for months, we gave up and replaced it with something custom. We (and hopefully you) are happy with where we’ve ended up.
I won’t name names, but since launch we dropped support for the last browser version that doesn’t have WebSockets. This means that all of our clients open WebSockets. At peak, we’ll have nearly 400,000 of them open. The bulk of the work is done by the super-fast ws node module.
Of course, WebSockets are two-way. We have a (small) message protocol built on JSON for talking with our clients - they tell us which messages they want to hear about, and we send ‘em on down. This means that a client can use a single connection to keep a number of boards up-to-date.
As I mentioned, polling is largely deprecated. We try to avoid it as much as possible because the HTTP overhead makes it so much more expensive than the alternative. Our WebSocket implementation is stable and reliable these days, so we’re no longer worried about needing to fall back to polling.
However, this doesn’t mean we’re dropping our polling routes. They’re important even for clients using WebSockets - if a WebSocket gets severed for any reason, we make a single poll request after reconnecting to make sure we didn’t miss anything.
How’s this work? Every time we throw a message into the pubsub system, we also push it into a ring buffer in Redis. We then increment a counter so we know the index of the message we just pushed on. When we get a poll request, we check their last message index against the index stored in redis, and pull off the right number of messages. Only if they’ve missed more messages than the ring buffer holds (or the buffer has been LRU’d out of Redis) will the client have to do a full reload of its data.
THIS COULD HAVE A BETTER TITLE AND I HAVENT DECIDED WHAT TO WRITE HERE, BUT IT SEEMS LIKE AN IMPORTANT SECTION TO HAVE