tl;dr: Don’t use Node 7 until this PR is released.
We’ve been working for quite some time to try and transition the Trello Server codebase to Node v7. Initially this was prevented by our quite old version of Mongoose, which got slower (by about 20%) when upgrading. After working through this and fixing a few more bugs along the way, Mongoose 4 and node v7 had been in production for about a week when our MongoDB monitoring alerted us to a long running query due to a full table scan.
The full table scan happened because Mongoose did not apply the “where-clause” to the final query sent to MongoDB, leading to a find and sort across the whole collection. Looking at the query that caused the scan, we determined that there was no way that it could be produced by our code if it was running properly; putting this together with a few very puzzling Sentry errors from earlier in the week, we realized to our horror that Mongoose was, very infrequently, treating query conditions as though they were null when they in fact were objects.
This assumption led us to immediately back out the Mongoose and Node v7 update to make sure that this would not corrupt any data in a non-revertible way, or expose any data where it should not be exposed.
Isolating The Error
After carefully reviewing the error reports, we found that we only hit the condition under Node 7, and that during the period (several days) when we were running Mongoose 4 before bumping to Node 7, we hadn’t seen it. This gave us the confidence to restore Mongoose 4 in production, and we focused on finding the Node version with the bug. We found only one other reference on the internet to someone else seeing the bug, so we were at least confident that we weren’t seeing something specific to our environment.
The errors we had seen in production were in very high-traffic paths, so we surmised that this was probably a ‘one-in-a-million’ bug that would only manifest occasionally. After failing to reproduce the issue on a local development environment by tightly looping the Mongoose code that we thought was to blame, we threw the entire power of our staging environment at it, doing the minimal query to reproduce the issue while applying some stress via simulated traffic to the servers using our Hammer service. Unfortunately, even then, it would take hours to reproduce it, and it failed to reproduce under ‘–inspect’, so we couldn’t attach a debugger and were confined to logging state when we hit the error. This was consistent with what the StackOverflow post reported, an object that was not null but toString’d to
Finally, a shot-in-the-dark Google search led us to the solution:
"[object Null]" v8 monorail (monorail being the V8 issue tracker), which revealed an issue the v8 team found already with their fuzzer. It’s worth noting that our previous searches of the V8 and Node issue trackers with similar terms had turned up nothing - site-specific Google search FTW. Chances are, had we not found this V8 issue, we’d still be stuck. Until node 8.0.0 of course, where the used V8 version had the whole file with the bug rewritten.
Something Very Strange
When you call a method on Mongoose’s query object (like
db.collection.count(conds)), Mongoose checks if you provide additional conditions (because it has to merge the provided conditions into previously provided conditions). It does this by comparing
Object.prototype.toString.call(conds) to “[object Object]” and does not include the conditions if that check fails.
However, the V8 version used in the whole node v7 line contains the bug which is not yet fixed, leading
toString to return
"[object Null]". It seems like the bug only happens in a 64bit process and is dependent on 32bits of pointer values being equal to one specific value (or 2, if you include undefined), causing it to only happen randomly and very seldom.
Why It’s Dangerous For Your Code, Too
toString of an object against