A NodeJS Perspective on What’s New in MongoDB 2.6, Part II: Aggregation $out

From a performance perspective as well as a developer productivity perspective, MongoDB really shines when you only need to load one document to display a particular page. A traditional hard drive only needs one sequential read to load a single MongoDB document, which limits your performance overhead. In addition, much like how Nas says life is simple because all he needs is one mic, grouping all the data for a single page into one document makes understanding and debugging the page much simpler.

A place where the one document per page heuristic is particularly relevant is on pages that display historical data. Loading a single user object is fast and simple, but running an aggregation to compute the average number of times per month a user performed a certain action over the last 6 months is a costly operation that you don’t necessarily want to do on-demand. NodeJS devs are spoiled in this regard, because scheduling in NodeJS is extremely simple. You can easily schedule these aggregations to run once per day and avoid the performance overhead of running the aggregation every time a user hits the particular page.

However, before MongoDB 2.6, shipping the results of an aggregation into a separate collection required pulling the aggregation results in through the NodeJS driver and inserting them back into MongoDB. Furthermore, aggregation results were limited to 16MB in size, which made doing aggregations that would output one document per user impossible. MongoDB 2.6, however, introduced a $out aggregation pipeline stage, which writes the output of the aggregation to a separate collection, and removed the 16MB aggregation limit.

Getting transformed data $out of aggregation

Let’s take a look at how this can be used in practice in NodeJS. Recall the food journal app from the first part of this series: let’s add a route that will display the user’s average calories per day broken down on a per-week basis. This involves a slow and complex aggregation, so we’ll schedule this aggregation to run once per day and dump its data to a new collection using $out. The data for this route will get recomputed for all users using one aggregation, and each time the user hits the API endpoint all the server will do is read one document. Here’s what the aggregation looks like in NodeJS (you can also copy/paste this aggregation pipeline into a mongo shell and get the same result). You can also find this code on Github.

mongodb.connection().collection('days').aggregate([
  // Pull out week of the year and day of the week from the date
  {
    $project : {
      week : { $week : "$date" },
      dayOfWeek : { $dayOfWeek : "$date" },
      year : { $year : "$date" },
      user : "$user",
      foods : "$foods"
    }
  },
  // Generate a document for each food item
  {
    $unwind : "$foods"
  },
  // And for each nutrient
  {
    $unwind : "$foods.nutrients"
  },
  // Only care about calories
  {
    $match : {
     'foods.nutrients.tagname' : 'ENERC_KCAL'
    }
  },
  // Add up calories for each week, keeping track of how many days in that
  // week the user recorded eating something. Output one document per
  // user and week.
  {
    $group : {
      _id : {
        week : "$week",
        user : "$user",
        year : "$year"
      },
      days : { $addToSet : '$dayOfWeek' },
      calories : {
        $sum : {
          $multiply : [
            '$foods.nutrients.amountPer100G',
            { $divide : ['$foods.selectedWeight.grams', 100] }
          ]
        }
      }
    }
  },
  // Aggregate all the documents on a per-user basis.
  {
    $group : {
      _id : "$_id.user",
      weeks : { $push : "$_id.week" },
      yearForWeek : { $push : "$_id.year" },
      daysPerWeek : { $push : "$days" },
      caloriesPerWeek : { $push : "$calories" }
    }
  },
  // Output to the 'weekly_calories' collection
  {
    // Hardcode string here so can copy/paste this aggregation into shell
    // for instructional purposes.
    $out : 'weekly_calories'
  }
], callback);

The particular details of the aggregation aren’t that important, what really matters is the $out stage at the end. The $out stage does something very cool: not only will the resulting documents get inserted into a new collection called weekly_calories, $out will overwrite the existing collection once the aggregation completes. In other words, if this aggregation runs for an hour, the weekly_calories collection will remain unchanged until the aggregation is done. After the aggregation finishes, the weekly_calories collection will be atomically replaced by the result of the aggregation. Note that, right now, $out doesn’t have any way of appending to the output collection, it can only overwrite the output collection. Design your aggregations accordingly.

Taking a look at the results

Using a bit of NodeJS magic, we can wrap this aggregation in a service that uses node-cron to schedule itself to run once per day at 0030 (12:30 am) server time:

image00

We can then inject this service into an ExpressJS route and expose the route as a GET /api/weekly JSON API endpoint:

// app.js
app.get('/api/weekly', checkLogin, api.byWeek.inject(di));

// api.js
exports.byWeek = function(weeklyCalorieAggregator) {
  return function(req, res) {
    weeklyCalorieAggregator.get(req.user.username, function(error, doc) {
      res.json(doc);
    });
  }
};

A little extra work (git diff) to put together a UI that displays the data from GET /api/weekly gives a very satisfying result:

image01

NodeJS Project Version Compatibility

Good news, this time around, the latest versions of node-mongodb-native (1.4.2), mquery (0.6.0), and mongoose (3.8.8) support $out in aggregation. I’ve run the above aggregation with versions 1.3 and 1.2 of node-mongodb-native and version 3.6 of mongoose and those handle $out correctly too.

Conclusion

MongoDB 2.6’s improvements to the aggregation framework are a quantum leap forward, and enable you to do some amazing things. While scheduled analytics calculations certainly aren’t the only use case of $out, I hope this post showed you at least one way in which $out allows you to play to MongoDB’s strengths in a new way.

This is Part II of a 3-part series on using new MongoDB 2.6 features in NodeJS. Part III of this series is coming up in 2 weeks, in which I’ll take a look at some of MongoDB 2.6’s query framework improvements, primarily index filters.

A NodeJS Perspective on What’s New in MongoDB 2.6, Part I: Text Search

MongoDB shipped the newest stable version of its server, 2.6.0, this week. This new release is massive: there were about 4000 commits between 2.4 and 2.6. Unsurprisingly, the release notes are a pretty dense read and don’t quite convey how cool some of these new features are. To remedy that, I’ll dedicate a couple posts to putting on my NodeJS web developer hat and exploring interesting use cases for new features in 2.6. The first feature I’ll dig in to is text search, or, in layman’s terms, Google for your MongoDB documents.

Text search was technically in 2.4, but it was an experimental feature and not part of the query framework. Now, in 2.6, text is a full-fledged query operator, enabling you search for documents by text in 15 different languages.

Getting Started With Text Search

Let’s dive right in and use text search on the USDA SR-25 data set described in this post. You can download a mongorestore-friendly version of the data set here. The data set contains 8194 food items with associated nutrition data, and each food item has a human-readable description, e.g. “Kale, raw” or “Bison, ground, grass-fed, cooked”. Ideally, as a client of this data set, we shouldn’t have to remember whether we need to enter “Bison, grass-fed, ground, cooked” or “Bison, ground, grass-fed, cooked” to get the data we’re looking for. We should just be able to put in “grass-fed bison” and get reasonable results.

Thankfully, text search makes this simple. In order to do text search, first we need to create a text index on your copy of the USDA nutrition collection. Lets create one on the food item’s description:


db.nutrition.ensureIndex({ description : "text" });

Now, we can search the data set for our “raw kale” and “grass-fed bison”, and see what we get:


db.nutrition.find(
  { $text : { $search : "grass-fed bison" } },
  { description : 1 }).
    limit(3);

db.nutrition.find(
  { $text : { $search : "raw kale" } },
  { description : 1 }).
    limit(3);

 

Unfortunately, the results we got aren’t that useful, because they’re not in order of relevance. Unless we explicitly tell MongoDB to sort by the text score, we probably won’t get the most relevant documents first. Thankfully, with the help of the new $meta keyword (which is currently only useful for getting the text score), we can tell MongoDB to sort by text score as described here:

db.nutrition.find(
  { $text : { $search : "raw kale" } },
  { description : 1, textScore : { $meta : "textScore" } }).
    sort({ textScore : { $meta : "textScore" } }).
    limit(3);

Using Text Search in NodeJS

First, an important note on the compatibility of text search with NodeJS community projects: the MongoDB NodeJS driver is compatible with text search going back to at least 1.3.0. However, only the latest version of mquery, 0.6.0, is compatible with text search. By extension, the popular ODM Mongoose, which relies on mquery, unfortunately doesn’t have a text search compatible release at the time of this blog post. I pushed a commit to fix this and the next version of Mongoose, 3.8.9, should allow you to sort by text score. In summary, to use MongoDB text search, here are the version restrictions:

MongoDB NodeJS driver: >= 1.4.0 is recommended, but it seems to work going back to at least 1.2.0 in my personal experiments.

mquery: >= 0.6.0.

Mongoose: >= 3.8.9 (unfortunately not released yet as of 4/9/14)

Now that you know which versions are supported, let’s demonstrate how to actually do text search with the NodeJS driver. I created a simple food journal (e.g. an app that counts calories for you when you enter in how much of a certain food you’ve eaten) app that is meant to tie in to the SR-25 data set. This app is available on GitHub here, so feel free to play with it.

The LeanMEAN app exposes an API endpoint, GET /api/food/search/:search, that runs text search on a local copy of the SR-25 data set. The implementation of this endpoint is here. For convenience, here is the actual implementation, where the foodItem variable is a wrapper around the Node driver’s connection to the SR-25 collection.

/* Because MongooseJS doesn't quite support sorting by text search score
* just yet, just use the NodeJS driver directly */
exports.searchFood = function(foodItem) {
 return function(req, res) {
   var search = req.params.search;
   foodItem.connection().
     find(
       { $text : { $search : search } },
       { score : { $meta: "textScore" } }
     ).
     sort({ score: { $meta : "textScore" } }).
     limit(10).
     toArray(function(error, foodItems) {
       if (error) {
         res.json(500, { error : error });
       } else {
         res.json(foodItems);
       }
     });
 }
};

Unsurprisingly, this code looks pretty similar to the shell version, so it shouldn’t look unfamiliar to you NodeJS pros 🙂

Looking Forward

And that’s all on text search for now. In the next post (scheduled for 4/25), we’ll tackle some of the awesome new features in the aggregation framework, including text search in aggregation.