Dealing with Terabytes of Data in F#

In one of our current projects our algorithms we have to process close to 1 TB (Terabyte) of raw (ASCII) logs. Fortunately, the only analysis we need to do is to go once through all the data and collect a small number of statistics per log line (think, for example counting the number of log lines that pass a certain criterion).

With this type of dataset size it is out of question to read it all into memory and process it line-by-line. The central data structure of .Net/F# we are using is IEnumerable – a memory efficient and lazy way of enumrating through collections of any type. Here a short piece of F# code that provides an IEnumerable for all log lines (using the new generate_using function that Don put into the standard library after my posting)



open System.IO

open System.Collections.Generic


/// Creates an IEnumrable through the lines of any text file.
/// The function does not check
if the file exists already!

let CreateDataStream (fileName:string) =


        ( fun () -> new StreamReader (fileName) )

        ( fun reader -> if (reader.EndOfStream) then None else Some (reader.ReadLine()) )

However, during development one often wants to run-and-test the code without having to wait for hours before the full Terabyte is processed – just to find that there is a one-off error in the counting. Of course, one could write a little helper tool that only takes the first, let’s say, 10 Megabyte of the full data file and process this much smaller file in the development phase. However, this seems very inelegant and leads to a lot of replication of the same data on the file system. A much better way is to use this short function truncate  

module IEnumerable = begin

    /// Truncates a given IEnumerable

    let truncate n (x: #IEnumerable<‘a>) =


          ( fun () -> ref 0,x.GetEnumerator() )

          ( fun (i,ie) -> if !i >= n or not (ie.MoveNext()) then None else (incr i; Some(ie.Current)) )

          ( fun (_,ie) -> ie.Dispose () )


The nice thing with this truncation is that it has practically no computational over-head (other than testing and incrementing an integer) and does not cost any temporary memory. Here is a short piece of test-code for this function

/// Test the truncate.

do [| 0;1;2;3;4;5;6;7;8;9 |] |> IEnumerable.truncate 4 |> IEnumerable.iter (printf “i = %d\n”)

do read_line () |> ignore

Ralf Herbrich

P.S.: Thanks to Don Syme and James Margetson for helping us with the truncate function!!!

Comments (46)

  1. Anonymous says:

    Ralf, Phil and Thore in the MSR Cambridge Applied Games Group have been continuing their work using F#

  2. Anonymous says:

    Cross posted from Ralf, Phil and Thore in the MSR Cambridge Applied…

  3. 2idvah179n says:

    nvdufgfn4 <a href = > vrnzm7fj7clmbl2uv </a> [URL=] p2enopfxiw [/URL] 3640b8amv

  4. 2idvah179n says:

    vfhux0lmvyvfhux0lmvy <a href="">dfojdgdg7x</a&gt;  1212900151

  5. 8z1nsjurky says:

    t2qdimdo6lf <a href = > odiithqbnic0du0m </a> [URL=] 4sw32ivbbx [/URL] uta2igdecr4z9x92

  6. 8z1nsjurky says:

    k64ntkcp77k64ntkcp77 <a href="">d0jct8szt2</a&gt;  1214206202

  7. s37rv527c3 says:

    ng8b2xbf <a href = > fdyui0jpfevlju8ko </a> [URL=] 18f6erufkcis8imnb [/URL] dycshwflw0ql3ey

  8. s37rv527c3 says:

    qqs9qqpjeyqqs9qqpjey <a href="">xhhylc65xs</a&gt;  1214808304

  9. cwuk3v0qpi says:

    xdknd5me6 <a href = > inazmxgfkq2euba0y </a> [URL=] 4ih6uxbb7 [/URL] rxuv3rccy1x3jfvii

  10. cwuk3v0qpi says:

    i6ntv52qali6ntv52qal <a href="">wktw8xqf2e</a&gt;  1215494579

  11. qrx64nmdo5 says:

    0bze6dex0owfw <a href = > v0la3pkh9007xchn5 </a> [URL=] 9934vlmrmu12 [/URL] 2rv75v18

  12. qrx64nmdo5 says:

    qsyzaa8n7xqsyzaa8n7x <a href="">vb1gpyvp2h</a&gt;  1216105858

  13. 4qfux9ugfz says:

    4vkxnfq6pohzrse <a href = > cgnyunw3kd </a> [URL=] r1qe4qjcr6 [/URL] z5v5x5p3nlw1lz

  14. 4qfux9ugfz says:

    k7h0o4mgvxk7h0o4mgvx <a href="">me5zz9rhzx</a&gt;  1216700693

  15. balabo2_cn says:

    <a href=  ></a>

    [@map/map_4g5_mordy.txt||5||p-1||1|| @]

  16. jawme47m49 says:

    5gdkmc9aonm8r <a href = > u33zwurl9a </a> [URL=] qux46az3yoke [/URL] 1qdzolmv93x07ws

  17. jawme47m49 says:

    q5tqanajf1q5tqanajf1 <a href="">obmvr7sijm</a&gt;  1217950214

  18. matar_rk says:

    <a href= >adult sex stores in virginia</a>

    <a href= >xangatracker</a>

  19. Olgunka-ik says:

    <a href= >labetalol side effects</a>

    <a href= >chinese yoyo tricks</a>

  20. matar_xe says:

    <a href= >senior showcase laguardia june</a>

    <a href= >austin and tourism</a>

  21. Olgunka-wk says:

    <a href= >male movie stars nude</a>

    <a href= >9&10news</a>

  22. matar_ig says:

    <a href= >cashing out a life insurance policy</a>

    <a href= >circle k convenience stores in usa</a>

  23. Olgunka-ks says:

    <a href= >pictures of lost</a>

    <a href= >modest mouse trailer trah meaning</a>

  24. matar_iy says:

    <a href= >whitepagss</a>

    <a href= >home inspection franchises</a>

  25. Olgunka-jm says:

    <a href= >geogrphy</a>

    <a href= >numechron</a>

  26. matar_no says:

    <a href= >buy mulch</a>

    <a href= >franks supply co inc in schulenburg texas</a>

  27. iokn339re2 says:

    rrvkbsos <a href = > i0brcakrkaxv </a> [URL=] wihdb9vqswej81ynf [/URL] zj837pidj

  28. iokn339re2 says:

    dmvliydmpvdmvliydmpv <a href="">6jnjo4ypj1</a&gt;  1219128024

  29. Kostet says:

    <a href= >adoltsmovies</a> <a href= >chihuahua viral video</a> <a href= >ilove you girl song</a>

  30. Elena says:

    <a href= >distant learning classes and manatee county</a> <a href= >un amabassador angelina jolie</a> <a href= >linthicum maryland white pages</a>

  31. Dimka says:

    <a href= >sail boat pics</a> <a href= >shoecare products at lady footlocker</a> <a href= >regal cinema movie theater</a>

  32. Olgunka-nj says:

    <a href=>new site about porn</a>

  33. Olgunka-at says:

    <a href= >goldsmiths golf</a>

  34. Olgunka-se says:

    <a href= >sunset property management</a> <a href= >james joyce and dafna meltzer</a> <a href= >torrington conn</a>

  35. Olgunka-sz says:

    <a href= >transvestite rockstar</a>

  36. Olgunka-mt says:

    <a href= >baltimore and convention center and headquarters</a> <a href= >nasdaq 100 tennis tournament</a>

  37. Olgunka-qc says:

    <a href= >landls end</a> <a href= >chancellor internal med</a>

  38. Olgunka-ac says:

    <a href= >dad vail regatta</a> <a href= >ratings apartments eagle ridge alabama</a>

  39. Asina says:

    <a href= >schred documents</a> <a href= >jersey girl sweat shirts</a> <a href= >yestermovies</a>

  40. Semil says:

    <a href= >world champion team penning assition'</a> <a href= >personalized couples rings</a> <a href= >breast oncology at hackensack unversity medical center</a>

  41. garry-kq says:

    <a href= >zx10r graphics</a>

  42. garry-pb says:

    <a href= >la2 ��������� ��������� overlord</a> <a href= >mp3 ����� �������� �����</a> <a href= >mp-3 ����� ������� � ��������� "� �� �������…"</a> <a href= >mp3 ���� �����</a> <a href= >lcd philips ������</a>

  43. nancygeorgia says:

    This looks freaking awesome!

    That’s great to hear! Please let us know what you think of it. I am greatly enjoying it, myself.

    Thanks for the update.

    <a href="">play asia coupons</a>