Wiki Life: Collecting Stats

This week I'd like to give you a sneek peak at one of the tools that I use in my Wiki Life.

I use it to generate the Top Contributors of the Week Awards.

It's a very simple web crawler, written in C# and WPF

This doesn't win any style awards you understand, it is a quick and dirty tool to get the stats I need, written mostly in one evening for the task I had been given.

Below is a short (6 minute) video that shows the tool in action, with a small (24 hours) date range.

 

[View:https://www.youtube.com/watch?v=ufaReEg36FE]

 

Please forgive the lack of audio track commentary.

Here is an outline of what you are seeing:

  • First I scan through all the pages from the "Updated Pages" section of the Wiki
  • Then (about 50 seconds into the vid) I scan each revision (history) page for each article that was found
  • Next (around 2 minutes in) I check each articles's revisions again, ditching old revisions and examining the "revision compare" page for each revision within our date range
  • From this information, I construct a thumbnail image of each document's changes
  • Finally (at 5 mins, 30 secs) it does one last pass through all the revisions for all the articles and checks to see which article was quickest to be updated by another user (one of the awards)
  • At 6 minutes I show the resulting collection of image files generated from the crawl
  • I finish with a quick glance through the columns and sort options that help me quickly generate the Saturday charts

 

If there are any fellow developers out there, you may ask why I physically load each page, instead of just processing raw html responses from the server. 

The answer is because many of these kind of pages, like the revision compare pages generate their content from Javascript loaded in the page and is not available from raw html response, but retrieved once the page has loaded.

For this reason, I have to physically load the page, wait for the Javascript to pull the page content, THEN read the page.

This means for a slow 2 hour crawl for a whole week, but works fine as a background job.

 

There are still plenty of stats I plan to collect and present over the coming months.

If you have any ideas for other awards we could present from this data, please let us know and I will try to include it in future crawls.

 

Regards,

Peter Laker