Dealing with Terabytes of Data in F#

In one of our current projects our algorithms we have to process close to 1 TB (Terabyte) of raw (ASCII) logs. Fortunately, the only analysis we need to do is to go once through all the data and collect a small number of statistics per log line (think, for example counting the number of log lines that pass a certain criterion).

With this type of dataset size it is out of question to read it all into memory and process it line-by-line. The central data structure of .Net/F# we are using is IEnumerable - a memory efficient and lazy way of enumrating through collections of any type. Here a short piece of F# code that provides an IEnumerable for all log lines (using the new generate_using function that Don put into the standard library after my posting)

#light

open System.IO

open System.Collections.Generic

/// Creates an IEnumrable through the lines of any text file.
/// The function does not check if the file exists already!

let CreateDataStream (fileName:string) =

IEnumerable.generate_using

( fun () -> new StreamReader (fileName) )

( fun reader -> if (reader.EndOfStream) then None else Some (reader.ReadLine()) )

However, during development one often wants to run-and-test the code without having to wait for hours before the full Terabyte is processed - just to find that there is a one-off error in the counting. Of course, one could write a little helper tool that only takes the first, let's say, 10 Megabyte of the full data file and process this much smaller file in the development phase. However, this seems very inelegant and leads to a lot of replication of the same data on the file system. A much better way is to use this short function truncate 

module IEnumerable = begin

    /// Truncates a given IEnumerable

    let truncate n (x: #IEnumerable<'a>) =

      IEnumerable.generate

          ( fun () -> ref 0,x.GetEnumerator() )

          ( fun (i,ie) -> if !i >= n or not (ie.MoveNext()) then None else (incr i; Some(ie.Current)) )

          ( fun (_,ie) -> ie.Dispose () )

end

The nice thing with this truncation is that it has practically no computational over-head (other than testing and incrementing an integer) and does not cost any temporary memory. Here is a short piece of test-code for this function

/// Test the truncate.

do [| 0;1;2;3;4;5;6;7;8;9 |] |> IEnumerable.truncate 4 |> IEnumerable.iter (printf "i = %d\n")

do read_line () |> ignore

Ralf Herbrich

P.S.: Thanks to Don Syme and James Margetson for helping us with the truncate function!!!