Friday, December 14, 2012

Sort a very large text file on limited memory


We are given a very large text file of size say 10 GB and we need to sort the file and store it in an output location. But the memory (RAM) is limited, say 1 GB.


1. Brute-Force Approach
Read the input file line by line using BufferedReader and store each line in a TreeSet (assume there are no duplicate lines). Once the file is completely read, store the contents of the TreeSet into the output file.
For my analysis, I used a 400 MB file containing line items. I extracted this from TPC-H dataset. You can use any large file. When I ran my program with 512 MB memory (VM Argument set to -Xmx512M), I got OutOfMemory. So I set it to 1GB, it ran fine. The time taken was 7 seconds.
So as we can observe, if we need to sort 10GB then we would need around 15-20 GB RAM. The memory utilization is by the TreeSet to hold the data.

2. External Sort
In this approach, we will only read the input file line by line and store it into the TreeSet. This step is similar to above, but we would not store the entire file in the TreeSet. If the file is 10 GB and our RAM capacity is 1 GB, we will store 512MB data in the TreeSet. Once it becomes 512 MB we flush it into disk, on a temporary file (call it temp1.txt). Repeat this procedure of writing into temporary files, till you read 10 GB completely. So we will have 20 sorted temporary files. Next step is to merge these 20 files into a single sorted file and delete the temporary files. We call it a K-Way merge algorithm. Consider the example below

temp1:  { dog, fan, jug }
temp2:  { egg, kit, rat, sun }
temp3:  { ant, gun, hat}

read the first items - dog, egg, ant. The smallest of it, write to output.txt and remove the element from temp file. To the list, add the next item from the removed file. So we will remove ant from temp3 and put it to output.txt. So in the list we will have - dog, egg. Add gun to the list (since we removed ant from temp3). Now next smallest is dog add to output.txt and add fan to the list. Repeat till all the elements are added.

I first implemented this method to sort a 400 MB file on a 40 MB RAM. The time taken was 22 seconds.
This is definitely a gain in the space-complexity at the cost of increased time. Can we do it better?

Next I utilized Memory-Mapped-Files (MMAP). For this we need to consider how Java does file IO. Data in the file system (secondary storage or hard-disk) has to be brought to memory (RAM). To do this, we can either read character by character, or read a chunk of data at a time. The time taken by file IO is huge, compared to the CPU processing time. So its better to reduce the frequency of file reads, hence we use BufferedReader.
But if we use MMAP, then a file or portion of a file can be mapped directly to memory. To achieve this, it uses SWAP space (or virtual memory). Every system has this virtual memory and is different from the RAM. In Java, objects are created on the Heap or RAM or memory as we call it. But if we bring data to swap space then the data can be read directly without an operating system call (which is magnitude of times slower). You cannot Memory Map a full 10 GB file, so we can MMAP a portion of the file. The other advantage of MMAP is, multiple processes can share the same file (thread-safe). In Java, we have the classes for this in NIO package. FileChannel is the class to hold the data.

With this approach, I memory mapped the 400 MB file for 8KB size. Read this 8KB data and store it in a TreeSet. Repeat 5000 times, and store the 40 MB data in a temporary file (we will have 10 temporary sorted files). Then apply k-way merge.
Using this approach, the program ran in 17 secs ( with a 5 secs gain ). Its generally, atleast 20% faster than the BufferedReader.





  1. You don't seem to have heard of replacement-selection or polyphase merging. It can be done far more efficiently than this.

  2. For fast sorting of a large file using Java NIO with lower memory profile please follow the following link :

    This describes exploiting java NIO file operations along with java's non blocking CAS data structures to sort the data with less memory.Uses k-way merge sort algorithm.The blog uses Spring Bath for illustration, but the concept can be reused without using Spring batch as well.

  3. The third code example doesnot sort it. What is the solution?

  4. Read to use bash script for sorting TB scale data on a regular machine with couple of GB ram: