Saturday 14 September 2013

MongoDB Aggregation vs Map Reduce

According to the release notes of mongodb 2.4, the Javascript engine has been changed from SpiderMonkey to V8. This is expected to give mongodb's map reduce the ability to run multiple threads and improves performance. The Aggregation Framework is expected to out run Map Reduce hands down since the former runs on compiled C++ code while the latter has to rely on the Javascript interpreter and conversions from JSON to BSON and vice versa to load dataset and store results. 
In any case, it is always an interesting exercise in attempting to answer a problem using the Aggregation Framework and the Map Reduce so towards this end I picked up a sample mongodb dataset named images with 90,017 records with the objective of counting the number of times a particular tag occured in an image record.

> db.images.count();
90017
> db.images.findOne();
{
        "_id" : 1,
        "height" : 480,
        "width" : 640,
        "tags" : [
                "dogs",
                "cats",
                "kittens",
                "vacation",
                "work"
        ]
}

The objective is to count the number of times, a particular tag occurs in the data set.

In the Aggregation Framework, this could be achieved by:

var c = db.images.aggregate(
  [{"$unwind":"$tags"},
  {$group:{_id:{Tag:"$tags"}, "Tag_Count":{$sum:1}}},
  {$sort: {"Tag_Count": -1}},
  {$limit :5}
  ]);
printjson( c ) ;
Results of the Aggregation Query
The Map Reduce solution was as follows:
map = function() {

if (!this.tags) {
       return;
    }
for(var i in this.tags){
   key = { Tag: this.tags[i] };
   value = 1;
   emit(key, value);
  }

}

reduce = function(key, value) {
    return Array.sum(value);
}

result = db.runCommand({"mapreduce" : "images",
"map" : map,
"reduce" : reduce,
"out" : "tag_count"});
printjson( result ) ;

Results of querying the collection produced by Map Reduce