MongoDB Basics & Internals
Big Data is a buzz word of today. It means Velocity, Volume and Variety of data. Google processes 24 PetaBytes of data per day. FaceBook users upload 300 Million photos and 3.2 Billion Likes/Comments per day. These are some examples of Velocity and Volume. "Variety" refers to different types of data handled by these applications. For instance, Log files have unstructured/semi-structured data. This is not what RDBMS is designed for. They are suitable for structured data and volumes not exceeding TeraBytes. Since RDBMS are disk-based they are relatively slow (IO-bound). Indexing is not a solution, when data grows inserts will get slow. Joins are big bottlenecks and hence you have to go for de-normalization.
High volume sites were mostly built on LAMP (Linux, Apache, MySQL, PHP) stack. Load-Balancing was done at web-tier and application-tier levels. However this was difficult with data-tier. So one solution people used was Memcached, acting as a caching layer in front of data-tier. They utilize a cluster of MySQL instances to split the data among these clusters. And these clusters were replicated as well. Further optimizations were done on this - MySQL Handler Socket was used to tune 500K requests per second. One of the solutions to address high-volumes was to use CDN(Content Delivery Networks) like Akamai. HTTP Accelerators like Varnish were used to handle web-tier load.
When Google published papers on Big Table and Map-Reduce, it changed the game. Doug Cutting, who was working with Yahoo then, built Hadoop (an open-source implementation of big table and map-reduce). Several NoSQL databases appeared then. We can broadly categorize them as Document-Oriented (MongoDB, CouchDB), Graph (Neo4J), Columnar (HBase, Cassandra) Databases. MongoDB is a document-oriented database from 10Gen. FourSquare, ShutterFly, CodeAcademy are a few of the customers who uses MongoDB.
MongoDB works on JSON datatypes and each document is a JSON Document, internally stored as BSON (Binary representation of JSON). Its void of Joins and Transactions. It supports Geo-Spatial queries. This is a USP for FourSquare to use MongoDB. Indexing, Auto-Sharding and Replication support are few more things in MongoDB. I could configure my 3 laptops to store Millions of User data. I configured laptop-1 and laptop-2 as Shards with users A-M stored on laptop-1 and N-Z on laptop-2. And laptop-3 I configured as a replica of laptop-1. Replication was immediate (less than a sec) and even offline-replication was simply awesome (bring down the replica and start again, automatically synced up). Replica Sets support voting for choosing Primary. There is support for Map-Reduce with Mongo (built on top of Hadoop). MongoDB provides several tools - important ones are mongod, mongos, mongo, mongostat. MongoDB uses Memory-mapped files to map the files to virtual memory. Hence 64-Bit machines are preferred with MongoDB. Internally the data is stored as a 2 level Linked-List. When a database (e.g. movieLens) is created it creates 3 files automatically - movieLens.ns (16 MB), movieLens.0 (64 MB) and movieLens.1 (128 MB). The movieLens.ns file is a giant HashMap that stores the metadata of the database and the collections ( tables are called collections in MongoDB). The movieLens.0 and movieLens.1 files are data files and the numbers are increased sequentially, sizes increased exponentially till it achieves 2 GB. The internal architecture and setup of MongoDB is a very interesting and a long story (for next post).
Movie Recommendation
We will build a movie recommendation system with MongoDB to handle massive volume of data. I used the 10 Million movie ratings provided by MovieLens. Download the data (MovieLens 10M) from http://www.grouplens.org/node/73. Extract it to a location like C:\Project\Recommender. It consists of 3 main files - movies.dat, ratings.dat, and tags.dat. We will load this data to MongoDB. For this, download MongoDB from http://www.mongodb.org/downloads. Install MongoDB in some location like c:\mongodb. Create a directory c:\mongodb\data. Start mongod with this command -
c:\mongodb\bin> mongod --dbpath c:\mongodb\data
Now we will run the following Java application to import data onto MongoDB. For this we need MongoDB Java driver and set the Heap Size to 1 GB.
*********************
Inserted 10681 Movies
Inserted 95580 Tags
Inserted 10000054 Ratings
Completed in 685 Seconds
*********************
Run mongo using the following command
c:\mongodb\bin> mongod
Now verify that the movieLens database is created. Some of the commands you can run are as below-
show dbs
show collections
db.movies.findOne()
db.ratings.find().count()
Next, create Indexes:
db.ratings.ensureIndex({user_id:1, item_id:1})
db.movies.ensureIndex({genres:1});
Next run map-reduce task to create a new Collection to store userId/RatingCount
map = function() { emit({user_id: this.user_id}, {count:1}) }
reduce = function(k, v) { var count = 0; v.forEach( function(x) { count += x['count']; }); return {count: count}; }
db.ratings.mapReduce(map, reduce, "user_ratings");
{
"result" : "user_ratings",
"timeMillis" : 520420,
"counts" : {
"input" : 10000054,
"emit" : 10000054,
"reduce" : 169150,
"output" : 69878
},
"ok" : 1,
}
Now we are ready to create our Recommendation Engine. The one which we are going to implement is a "Collabarative Filtering" application, similar to Amazon or Netflix. The algorithms used by these sites are normally Weighted Slope-One or Singular Value Decomposition (SVD). Twitter for instance uses SVD. Apache Mahout (an open source Machine Learning solution for Big Data) provides Recommendation System. I have used a similar strategy here.
The idea used here is-
Users rate movies on a scale of 1-5. To know what movie to recommend for a given user (user_1), we use the ratings that he provided for movies. Find out k-neighbours (users who are similar to him in taste). From this list of users, find out the movies that they have rated. Get a weighted list of movies based on ratings, ignoring movies that he (user_1) has already rated. The basics for this are - Manhattan Distance, Eucledians Distance, Minkowski Distance, Pearson's Co-efficient and K-Nearest-Neighbors.
The output of the program is-
*******************
Loaded 10000054 ratings
Loaded 10681 movies
Recommended for 16
Lord of the Rings: The Two Towers, The
Liar Liar
Rudy
Field of Dreams
Christmas Story, A
Gone in 60 Seconds
Remember the Titans
Striptease
Hoosiers
No Country for Old Men
Recommended for 127
Clockwork Orange, A
Fight Club
Reservoir Dogs
12 Monkeys (Twelve Monkeys)
Shawshank Redemption, The
Silence of the Lambs, The
Terminator 2: Judgment Day
One Flew Over the Cuckoo's Nest
Fargo
Shining, The
*******************
Loaded 10000054 ratings
Loaded 10681 movies
Recommended for 16
Lord of the Rings: The Two Towers, The
Liar Liar
Rudy
Field of Dreams
Christmas Story, A
Gone in 60 Seconds
Remember the Titans
Striptease
Hoosiers
No Country for Old Men
Recommended for 127
Clockwork Orange, A
Fight Club
Reservoir Dogs
12 Monkeys (Twelve Monkeys)
Shawshank Redemption, The
Silence of the Lambs, The
Terminator 2: Judgment Day
One Flew Over the Cuckoo's Nest
Fargo
Shining, The
In the next post I will discuss more about how the Recommendation System works and the algorithm