Plugging USDA Nutrition Data into MongoDB

As much as I love geeking out about basketball stats, I want to put a MongoDB data set out there that’s a bit more app-friendly: the USDA SR25 nutrient database. You can download this data set from my S3 bucket here, and plug it into your MongoDB instance using mongorestore. I’m very meticulous about nutrition and have, at times, kept a food journal, but sites like FitDay and DailyBurn have far too much spam and are far too poorly designed to be a viable option. With this data set, I plan on putting together an open source web-based food journal in the near future. However, I encourage you to use this data set to build your own apps.

Data Set Structure

The data set contains one collection, ‘nutrition’. The documents in this collection contain merged data from the SR25 database’s very relational FOOD_DES, NUTR_DEF, NUT_DATA, and WEIGHT files. In more comprehensible terms, the documents contain a description of a food item, a list of nutrients with measurements per 100g, and a list of common serving sizes for that food. Here’s what the top level document for grass-fed ground bison looks like in RoboMongo, a simple MongoDB GUI:

The top level document is fairly simple: the description is a human-readable description of the food, the manufacturer is the company that manufactures the product, and survey is whether or not the data set has values for the 65 nutrients used for some government survey. However, the real magic happens in the nutrients and weights subdocuments. Lets see what happens when we open up nutrients:

You’ll see that there are an incredible amount of nutrients. The nutrients data is in an array, where each subdocument in the array has a tagname, which is a common scientific abbreviation for the nutrient, a human-readable description, and an amountPer100G with corresponding units. In the above example, you’ll see that 100 grams of cooked grass-fed ground bison contains about 25.45 g of protein.

(Note: the original data set includes some more detailed data, including standard deviations and sample sizes for the nutrient measurements, but that’s outside the scope of what I want to do with this data set. If you want that data, feel free to read through the government data set’s documentation and fork my converter on github.)

Finally, the weights subdocument is another array which contains sub-documents that describe common serving sizes for the food item and their mass in grams. In the grass-fed ground bison example, the weights list contains a single serving size, 3 oz, which approximately 85 grams:

Exploring the Data Set

First things first: since the nutrients for each food are in an array, its not immediately obvious what nutrients this data set has. Thankfully, MongoDB’s distinct command makes this very easy:

There are a lot of different nutrients in this data set. In fact, there are 145:

So how are we going to find nutrient data for a food that we’re interested in? Suppose we’re looking to find how many carbs are in raw kale. Pretty easy to do because MongoDB’s shell supports JavaScript regular expressions, so lets just find documents where the description includes ‘kale’:

Of course, this doesn’t include the carbohydrate content, so lets add a $elemMatch to the projection to limit output to the carbohydrates in raw kale:

Running Aggregations to Test Nutritional Claims

My favorite burger joint in Chelsea, brgr, claims that grass-fed beef has as much omega-3 as salmon. Lets see if this advertising claim holds up to scrutiny:

Right now, this is a bit tricky. Since I imported the data from the USDA as-is, total omega-3 fatty acids is not tracked as a single nutrient. The amounts for individual omega-3 fatty acids, such as EPA and DHA, are recorded separately. However, the different types of omega-3 fatty acids all have n-3 in the description, so it should be pretty easy to identify which nutrients we need to sum up to get total omega-3 fatty acids. Of course, when you need to significantly transform your data, its time to bust out the MongoDB aggregation framework.

The first aggregation we’re going to do is find the salmon item that has the least amount of total omega-3 fatty acids per 100 grams. To do that, we first need to transform the documents to include the total amount of omega-3’s, rather than the individual omega-3 fats like EPA and DHA. With the $group pipeline state and the $sum operator, this is pretty simple. Keep in mind that the nutrient descriptions for omega-3 fatty acids are always in grams in this data set, so we don’t have to worry about unit conversions.


You can get a text version of the above aggregation on Github. To verify brgr’s claim, lets run the same aggregation for grass-fed ground beef, but reversing the sort order.


Looks like brgr’s claim doesn’t quite hold up to a cursory glance. I’d be curious to see what the basis for their claim is, specifically if they assume a smaller serving size for salmon than for grass-fed beef.

Conclusion

Phew, that was a lot of information to cram into one post. The data set, as provided by the USDA, is a bit complex and could really benefit from some simplification. Thankfully, MongoDB 2.6 is coming out soon, and, with it, the $out aggregation operator. The $out operator will enable you to pipe output from the aggregation framework to a separate collection, so I’ll hopefully be able to add total omega-3 fatty acids as a nutrient, among other things. Once again, feel free to download the data set here (or check out the converter repo on Github) and use it to build some awesome nutritional apps.

 

 

Advertisements

Why Math is Necessary for CS Majors

While math and computer science have been lumped together for about as long as the latter has existed, there’s a lot of backlash recently toward the idea that a solid math background is integral to being a good developer. The relationship between the two was something that I struggled to grasp as an undergraduate in Computer Science. The relationship between math and CS isn’t as direct as, say, math and physics, or even philosophy and CS. However, taking a rigorous pure math course as an undergraduate will help you significantly, whether you choose to be an ivory tower academic, a developer for the latest hip startup out of Silicon Valley, or an engineer for a big NYC bank.

The reason why has nothing to do with learning what most people would call “practical skills.” Even as an undergraduate specializing in theory and theoretical computer vision, I realized that the limit of my practical use of mathematics was a high school-level understanding of linear algebra, some basic graph theory, and whatever I needed for big-O notation. While some advanced mathematics, like Galois Theory, have CS-related applications, you probably won’t use them in CS outside of the most closeted of ivory towers. I can honestly say that, in the 8 years since I got my first software engineering internship back in high school, I’ve never had to use anything I learned in undergrad Real Analysis or Galois Theory (thankfully, because I honestly deserved to fail Galois Theory). So clearly, when I completed the required courses to graduate as a math major, I was wasting my time, right? Wrong!

The fallacy in the above reasoning is that learning CS isn’t a video game styled tech tree. Just because understanding a proof of Stokes’ Theorem isn’t a strict prerequisite for being an effective developer, doesn’t mean that it doesn’t help. As Irish poet W.B. Yeats once wisely said, “education is not the filling of a pail, but the lighting of a fire.” Similarly, learning to be a developer isn’t about crossing off a checklist of practical skills and making your resume look like a buzzword bingo board. Learning to be a developer is about practice (which is why Allen Iverson wasn’t a software developer), and what a pure math class gives you is a slightly different environment in which to practice your skills. When you have spent a little time looking at both, you’ll realize that going through a pure math textbook and remembering the correct theorems and lemmas to use in a homework proof is pretty damn similar to figuring out which modules you need to effectively add some new functionality to your codebase.

Many engineers bemoan the lack of unit testing instruction in undergrad CS curricula, they forget that unit tests are only useful if you have the rigor to write them in the first place. Not only is the process of figuring out which theorems to use an exercise in dependency management, but the process of proving a theorem is similar to writing unit tests to prove the correctness of your module. A lot of developers nowadays have copped a bad attitude when it comes to writing proper unit tests, saying that their code is trivially correct. I bet these people haven’t sat down to prove that there are no rational numbers satisfying the equation x^2 = 2 either, and I think that a solid grounding in pure math can nip this weakness in the bud. This way, Rudin’s Principles of Mathematical Analysis, the bane of every freshman math major’s existence, is essentially a large codebase for you to practice on.

Similarly, graph theory is absolutely integral to the day-to-day of being a software developer, even if you’re not working with graphs directly. Speaking of modules, to an experienced developer, a well-organized codebase looks a lot like a graph. Pieces of code are bits of logic with dependencies which are references to other bits of logic, which, of course, intuitively maps to nodes and edges. All refactoring work comes down to just thinking about graphs and how to make your code graph comprehensible. Beyond this simple example of refactoring and managing dependencies in code, the applications of reasoning about graphs in software development are endless, from breaking a bulky task down into components, to thinking about points of failure in a sophisticated network topology. While I’ve never had to think about Hadwinger’s conjecture in a professional context, my undergrad Graph Theory course gave me a lot of valuable practice reasoning about graphs in a rigorous way. This practice continues to serve me well to this day, whether I’m trying to organize my dependencies in AngularJS, thinking about the topology of my MongoDB cluster, or just figuring what tasks I need to get done today.

Bottom line, kids out there, if you really want to be a successful software developer, taking proof-based math (and proof-based graph theory in particular) is an excellent step in the right direction. It won’t be easy, but becoming good at something never is.

Pure-mathematics-formulæ-blackboard