John's Techno Phile: mongodb

Showing posts with label mongodb. Show all posts

Saturday, 14 September 2013

MongoDB Aggregation vs Map Reduce

According to the release notes of mongodb 2.4, the Javascript engine has been changed from SpiderMonkey to V8. This is expected to give mongodb's map reduce the ability to run multiple threads and improves performance. The Aggregation Framework is expected to out run Map Reduce hands down since the former runs on compiled C++ code while the latter has to rely on the Javascript interpreter and conversions from JSON to BSON and vice versa to load dataset and store results.

In any case, it is always an interesting exercise in attempting to answer a problem using the Aggregation Framework and the Map Reduce so towards this end I picked up a sample mongodb dataset named images with 90,017 records with the objective of counting the number of times a particular tag occured in an image record.

> db.images.count();
90017
> db.images.findOne();
{
"_id" : 1,
"height" : 480,
"width" : 640,
"tags" : [
"dogs",
"cats",
"kittens",
"vacation",
"work"
]
}

The objective is to count the number of times, a particular tag occurs in the data set.

In the Aggregation Framework, this could be achieved by:

var c = db.images.aggregate(
[{"$unwind":"$tags"},
{$group:{_id:{Tag:"$tags"}, "Tag_Count":{$sum:1}}},
{$sort: {"Tag_Count": -1}},
{$limit :5}
]);
printjson( c ) ;

Results of the Aggregation Query

The Map Reduce solution was as follows:

map = function() {

if (!this.tags) {

return;

}

for(var i in this.tags){

key = { Tag: this.tags[i] };

value = 1;

emit(key, value);

}

reduce = function(key, value) {

return Array.sum(value);

}

result = db.runCommand({"mapreduce" : "images",

"map" : map,

"reduce" : reduce,

"out" : "tag_count"});

printjson( result ) ;

Results of querying the collection produced by Map Reduce

Saturday, 17 August 2013

MongoDB - A database for Big Data

The need for managing large datasets that grow very rapidly has always been there. With social networking sites, generating millions of lines of data every minute, the need to store, process and analyse large datasets is not only required but also critical for surving in the race to remain relevant. This scenario has given an impetus to the rise of NO SQL databases. NO SQL or Not Only SQL databases can be divided into 4 main categories :

Key-Value databases : Voldemort
Graph databases : InfiniteGraph, Neo4J
Document databases : CouchDB, MongoDB
Column Family stores : Cassandra, HBase, Hyperbase

MongoDB is a non-relational JSON document store or database. It doesn't support the relational algebra that is most often expressed as SQL.
Documents are expressed as JSON and are stored within the MongoDB database in the BSON (Binary JSON) (bsonspec.org) format.
BSON supports all the data types available in JSON and a few more. It can support Strings, Floating-point numbers,Arrays, Objects and Timestamps data types.
MongoDB supports documents in the same collection that do not have have the same schema.This is referred to as supporting dynamic schema or it is schemaless.
There is no SQL, no transaction management and no Joins.
The fact that there is no transaction management across multiple documents and no JOINs makes MongoDB a better suited for scalability and performance.

Problems with traditional RDBMs and the need for MongoDB type No-SQL databases :

Traditional RDBMSs are based on concepts that promote strong referential integrity, data normalization and transaction management. This implies that every data model will have several database tables and to satisfy a query, it might need to perform a JOIN. JOINs are not your best friend when you are looking for speed and performance.

Transaction Management is another key area that provides reliablity and data consistency. However this leads to a drop in performance.

MongoDB does not utilize JOINs or provide Transaction Management. This results in a performance boost and data access is very fast as compared to Traditional RDBMSs.

While Transaction Management is not supported within MongoDB, it does however guarentee atomic transactions. Atomic transactions are updates affecting one document only. This one document may contain other sub-documents.

MongoDB uses binary-encoded JSON (BSON) which stores different data types efficiently, and the fact that no storage is allocated for fields that do not exist in a particular document,

implies that documents which do not have a complete set of entries, significant savings in storage can be achieved in comparison to RDBMs, where space must be reserved in every row for every field whether populated or null

Large document sets can also be split (or sharded) over multiple servers and automatically redistributed when additional servers are added for additional scalability

Real world use cases : SAP, Sourceforge, MTV, Twitter (http://www.mongodb.org/about/production-deployments/)

Thus, MongoDB is all about performance, scalability and speed, however there are still some scenarios where MongoDB might not be the best fit. These are as follows:

1. Not suited for an application requiring transaction management.

2. Designed to work behind fire walls so it has less security relative to RDBMs.

3. Documents in MongoDB are limited to 16MB. Once a collection reaches this size, it has to broken up and hosted across various shards.

Wednesday, 3 July 2013

M101J : MongoDB for Java Developers by 10Gen

Recently, I undertook the MongoDB for Java developers course provided by 10gen Education. The course duration is 7 weeks with an estimated effort of 10 hours per week. Delivery of the course material was online using videos and every week there were weekly assignments and at the conclusion of the 7th week, there was an exam.

As a part of the course-ware, every week there were several short video lectures that discussed various topics ranging from the installation of MongoDB to Sharding and Replication. Each video duration ranged from 2 minutes to 8 minutes and it would take around 2-4 hours to get through all the videos for the given week and complete the associated quiz questions. At the end of the course videos, every week there were also a set of homework questions which had to be completed before the end of the week. These homework questions had a 50% weight-age towards the final score while the final exam made up the other 50% of the course grade.

Overall my experience with the course was very positive. The course has been well designed and the instructors are closely related to the development of MongoDB at 10gen and so have a very clear working on the internals of MongoDB . I had no experience with MongoDB prior to undertaking this course and by the end of the 7th week, I was confident in setting up Replica Sets and Sharded servers.
The only irritant to the course was that it became clear that 10gen Education were running the course for the 1st time for Java developers as there were a few cases when they mixed up data sets and code from the Python version of the course which in turn led to confusion among the students. An unfortunate miss was incorrectly marking a correct answer wrong in the final exam which raised the ire of the students but most of these problems were ironed out within a day. Apart from these issues, the course material and teaching was very good and if one is looking to learn MongoDB, this course would be a good place to start.

Finally, my exam score. I got 90% in the course. The 10% I lost was in the final exam where I got 2 questions wrong. According to the stats released my 10Gen, "Of the 7,105 students enrolled, 1,434 students completed the course successfully, a completion rate of 20%." To achieve a grade of completion, one needed to achieve a mark of above 65%.