MongoDB Basics & Internals
Big Data is a buzz word of today. It means Velocity, Volume and Variety of data. Google processes 24 PetaBytes of data per day. FaceBook users upload 300 Million photos and 3.2 Billion Likes/Comments per day. These are some examples of Velocity and Volume. "Variety" refers to different types of data handled by these applications. For instance, Log files have unstructured/semi-structured data. This is not what RDBMS is designed for. They are suitable for structured data and volumes not exceeding TeraBytes. Since RDBMS are disk-based they are relatively slow (IO-bound). Indexing is not a solution, when data grows inserts will get slow. Joins are big bottlenecks and hence you have to go for de-normalization.
High volume sites were mostly built on LAMP (Linux, Apache, MySQL, PHP) stack. Load-Balancing was done at web-tier and application-tier levels. However this was difficult with data-tier. So one solution people used was Memcached, acting as a caching layer in front of data-tier. They utilize a cluster of MySQL instances to split the data among these clusters. And these clusters were replicated as well. Further optimizations were done on this - MySQL Handler Socket was used to tune 500K requests per second. One of the solutions to address high-volumes was to use CDN(Content Delivery Networks) like Akamai. HTTP Accelerators like Varnish were used to handle web-tier load.
When Google published papers on Big Table and Map-Reduce, it changed the game. Doug Cutting, who was working with Yahoo then, built Hadoop (an open-source implementation of big table and map-reduce). Several NoSQL databases appeared then. We can broadly categorize them as Document-Oriented (MongoDB, CouchDB), Graph (Neo4J), Columnar (HBase, Cassandra) Databases. MongoDB is a document-oriented database from 10Gen. FourSquare, ShutterFly, CodeAcademy are a few of the customers who uses MongoDB.
MongoDB works on JSON datatypes and each document is a JSON Document, internally stored as BSON (Binary representation of JSON). Its void of Joins and Transactions. It supports Geo-Spatial queries. This is a USP for FourSquare to use MongoDB. Indexing, Auto-Sharding and Replication support are few more things in MongoDB. I could configure my 3 laptops to store Millions of User data. I configured laptop-1 and laptop-2 as Shards with users A-M stored on laptop-1 and N-Z on laptop-2. And laptop-3 I configured as a replica of laptop-1. Replication was immediate (less than a sec) and even offline-replication was simply awesome (bring down the replica and start again, automatically synced up). Replica Sets support voting for choosing Primary. There is support for Map-Reduce with Mongo (built on top of Hadoop). MongoDB provides several tools - important ones are mongod, mongos, mongo, mongostat. MongoDB uses Memory-mapped files to map the files to virtual memory. Hence 64-Bit machines are preferred with MongoDB. Internally the data is stored as a 2 level Linked-List. When a database (e.g. movieLens) is created it creates 3 files automatically - movieLens.ns (16 MB), movieLens.0 (64 MB) and movieLens.1 (128 MB). The movieLens.ns file is a giant HashMap that stores the metadata of the database and the collections ( tables are called collections in MongoDB). The movieLens.0 and movieLens.1 files are data files and the numbers are increased sequentially, sizes increased exponentially till it achieves 2 GB. The internal architecture and setup of MongoDB is a very interesting and a long story (for next post).
Movie Recommendation
We will build a movie recommendation system with MongoDB to handle massive volume of data. I used the 10 Million movie ratings provided by MovieLens. Download the data (MovieLens 10M) from http://www.grouplens.org/node/73. Extract it to a location like C:\Project\Recommender. It consists of 3 main files - movies.dat, ratings.dat, and tags.dat. We will load this data to MongoDB. For this, download MongoDB from http://www.mongodb.org/downloads. Install MongoDB in some location like c:\mongodb. Create a directory c:\mongodb\data. Start mongod with this command -
c:\mongodb\bin> mongod --dbpath c:\mongodb\data
Now we will run the following Java application to import data onto MongoDB. For this we need MongoDB Java driver and set the Heap Size to 1 GB.
*********************
Inserted 10681 Movies
Inserted 95580 Tags
Inserted 10000054 Ratings
Completed in 685 Seconds
*********************
Run mongo using the following command
c:\mongodb\bin> mongod
Now verify that the movieLens database is created. Some of the commands you can run are as below-
show dbs
show collections
db.movies.findOne()
db.ratings.find().count()
Next, create Indexes:
db.ratings.ensureIndex({user_id:1, item_id:1})
db.movies.ensureIndex({genres:1});
Next run map-reduce task to create a new Collection to store userId/RatingCount
map = function() { emit({user_id: this.user_id}, {count:1}) }
reduce = function(k, v) { var count = 0; v.forEach( function(x) { count += x['count']; }); return {count: count}; }
db.ratings.mapReduce(map, reduce, "user_ratings");
{
"result" : "user_ratings",
"timeMillis" : 520420,
"counts" : {
"input" : 10000054,
"emit" : 10000054,
"reduce" : 169150,
"output" : 69878
},
"ok" : 1,
}
Now we are ready to create our Recommendation Engine. The one which we are going to implement is a "Collabarative Filtering" application, similar to Amazon or Netflix. The algorithms used by these sites are normally Weighted Slope-One or Singular Value Decomposition (SVD). Twitter for instance uses SVD. Apache Mahout (an open source Machine Learning solution for Big Data) provides Recommendation System. I have used a similar strategy here.
The idea used here is-
Users rate movies on a scale of 1-5. To know what movie to recommend for a given user (user_1), we use the ratings that he provided for movies. Find out k-neighbours (users who are similar to him in taste). From this list of users, find out the movies that they have rated. Get a weighted list of movies based on ratings, ignoring movies that he (user_1) has already rated. The basics for this are - Manhattan Distance, Eucledians Distance, Minkowski Distance, Pearson's Co-efficient and K-Nearest-Neighbors.
The output of the program is-
*******************
Loaded 10000054 ratings
Loaded 10681 movies
Recommended for 16
Lord of the Rings: The Two Towers, The
Liar Liar
Rudy
Field of Dreams
Christmas Story, A
Gone in 60 Seconds
Remember the Titans
Striptease
Hoosiers
No Country for Old Men
Recommended for 127
Clockwork Orange, A
Fight Club
Reservoir Dogs
12 Monkeys (Twelve Monkeys)
Shawshank Redemption, The
Silence of the Lambs, The
Terminator 2: Judgment Day
One Flew Over the Cuckoo's Nest
Fargo
Shining, The
*******************
Loaded 10000054 ratings
Loaded 10681 movies
Recommended for 16
Lord of the Rings: The Two Towers, The
Liar Liar
Rudy
Field of Dreams
Christmas Story, A
Gone in 60 Seconds
Remember the Titans
Striptease
Hoosiers
No Country for Old Men
Recommended for 127
Clockwork Orange, A
Fight Club
Reservoir Dogs
12 Monkeys (Twelve Monkeys)
Shawshank Redemption, The
Silence of the Lambs, The
Terminator 2: Judgment Day
One Flew Over the Cuckoo's Nest
Fargo
Shining, The
In the next post I will discuss more about how the Recommendation System works and the algorithm
Are you going to post the follow up?
ReplyDeleteAny chance to get the second part of this post? we missed the most important part of the recommendation system.
ReplyDeleteThank you
Please can anyone explain why this implementation of recommendation system is better or advantageous than many other implementations that are already available online on different sites??
ReplyDeleteany one can post the second part of the recommendation system. Please.
ReplyDeleteThe DataSet link updated - http://grouplens.org/datasets/movielens/1m/
ReplyDelete