In one of our current projects our algorithms we have to process close to 1 TB (Terabyte) of raw (ASCII) logs. Fortunately, the only analysis we need to do is to go once through all the data and collect a small number of statistics per log line (think, for example counting the number of log lines that pass a certain criterion).
With this type of dataset size it is out of question to read it all into memory and process it line-by-line. The central data structure of .Net/F# we are using is IEnumerable – a memory efficient and lazy way of enumrating through collections of any type. Here a short piece of F# code that provides an IEnumerable for all log lines (using the new generate_using function that Don put into the standard library after my posting)
/// Creates an IEnumrable through the lines of any text file.
/// The function does not check if the file exists already!
let CreateDataStream (fileName:string) =
( fun () -> new StreamReader (fileName) )
( fun reader -> if (reader.EndOfStream) then None else Some (reader.ReadLine()) )
However, during development one often wants to run-and-test the code without having to wait for hours before the full Terabyte is processed – just to find that there is a one-off error in the counting. Of course, one could write a little helper tool that only takes the first, let’s say, 10 Megabyte of the full data file and process this much smaller file in the development phase. However, this seems very inelegant and leads to a lot of replication of the same data on the file system. A much better way is to use this short function truncate
module IEnumerable = begin
/// Truncates a given IEnumerable
let truncate n (x: #IEnumerable<‘a>) =
( fun () -> ref 0,x.GetEnumerator() )
( fun (i,ie) -> if !i >= n or not (ie.MoveNext()) then None else (incr i; Some(ie.Current)) )
( fun (_,ie) -> ie.Dispose () )
The nice thing with this truncation is that it has practically no computational over-head (other than testing and incrementing an integer) and does not cost any temporary memory. Here is a short piece of test-code for this function
/// Test the truncate.
do [| 0;1;2;3;4;5;6;7;8;9 |] |> IEnumerable.truncate 4 |> IEnumerable.iter (printf “i = %d\n”)
do read_line () |> ignore
P.S.: Thanks to Don Syme and James Margetson for helping us with the truncate function!!!