Saturday, December 1, 2012

FaceBook - Write a function that can show the number of users online at any given time

You're given a period of time where users log in and log out and a set of login and log out times for those users. Facebook is asking you to create an efficient algorithm that will calculate a simple, but extremely important, metric.

Facebook currently has more than 1 Billion users and assume that each user is given a unique identifier like user_192. Hundreds of people login (and logout) to FB every second. There is one audit log maintained explicitly to store the time of login/logout and the userid. A sample log is given below-

[10/11/2012 10:25:06] Login user_4
[10/11/2012 10:28:55] Login user_6
[10/11/2012 10:29:19] Logout user_4
[10/11/2012 10:36:33] Login user_8

Using the above information, we need to come up with the most efficient algorithm (fast and low memory utilization) to calculate the users logged in at any time. For instance in the above example, if I query for 10/11/2012 10:29:00, there were 2 users logged in user_4 and user_6. Assume that there can be users who are logged into FaceBook the entire day (and never logout). Also assume that the entire 1 Billion users login every day. Most importantly, we need to consider boundary conditions as well. As in somebody logged in at 23:55:00 and logged out the next day at 01:17:12. 

There were a few solutions available, which had time complexity O(n log n). However, I was thinking if I could do i in 1 pass, which is O(n). And also, I didn't want to use the BitSet approach as I would have to create a BitSet of size 1 Billion every day. So, I came up with this solution.

Maintain a map, where the key is the date (10/11/2012). The value is an int[] of size 86400 (total number of secs in a day). The values of the array are pre-initialized to -1. Maintain a counter initialized to 0. Now go through the log file, for each log entry convert the time portion into secs. If the activity is "Login" then increment the counter, if "Logout" then decrement the counter. Goto the array position and set the value to the value of the counter. In the figure above, the blue window represents the time user_7 is logged in. The green window represents the time user_4 is logged in. The red window represents users who are logged in fo 2 overlapping days. So what we have achieved is, in linear time we have created our data structure.

Now lets see how we can do querying. Convert the time to seconds, get the corresponding value from the int array. If the value is not -1, then the value is the result. 
If the value is -1, from the current position of the array navigate back till you encounter a value which is not -1 (case a). While doing so, you might end up at the beginning of the array and still not found any value which is not -1 (case b). In case a, the value at the position is the result. In case b, we need to go back and refer to the previous day's map. In that map's int array, start from the end of the array till you find the value not -1. 
Will see with an example. In the figure above, at time 37411 we had 1 user. At time 37506 we had 2 users. So if I query for time 37506 we can directly say 2. If we query for 37500 we have 1 user. How did we arrive at this? At 37500 the value was -1. So we navigate left and at 37411 we get  value 1. That is the result.

Note:  In a real-world scenario the user log would not be a single text file, but a distributed file. Facebook have their own logging system PTail and Puma. And the back-end uses HBase over HDFS. So the log would be broken into multiple 64 KB pieces. Map-reduce job runs on each of these pieces in parallel building the above map.


LOG File

1 comment:

  1. Hi Raju , Nice illustration , Any reason you've used

    Below snippet instead of if (null==logCounter.dailyStats.get(date2)) .. do some processing

    if(!date2.equals(date1)) {
    counter = new int[86400];
    Arrays.fill(counter, -1);
    date1 = date2;
    logCounter.dailyStats.put(date2, counter);

    Also Map-Based approach something strike in my mind , Can you tell me that O(nlogn) algorithm you said in beginning ?