Experimenting with Visual Studio 2010 and backing up the entries on my blog

I spent some time today experimenting with the Visual Studio 2010. As I learned a long time ago, the best way to experience a development environment or programming language is to try to implement a solution with it. So I set out to create a small tool to download all my blog posts at https://blogs.technet.com/josebda. They are 301 posts (not counting this one) and the service I use only shows them in groups or pages. My goal was to get the entire text of the 301 posts into a single HTML file.

I started by looking for some sort of API to do this. I was hoping for a web service or some sort of RESTful interface. I did not find any, but that’s OK. I was experimenting with the environment, so the fact that I had to write a little more parsing code was fine. My basic tools were C#, the web browser control and the components for file IO. This is not anything new in C# or .NET 4.0 and that was fine by me to experiment with the environment.

I just finished the tool, which ended up as a Windows Form app with a few controls to select the type of blog (MSDN or TechNet), the name of the blog (josebda, in my case) and the destination file (of type HTM). When you click the GO button, it loads each page one by one, using a URL like https://blogs.technet.com/josebda/default.aspx?p=1. Then I would look for the DIV tag within each page, specifically the ones with a “class=post” attribute. I used a TextWriter to write the output. The resulting 2MB file is available from https://www.barreto.us/josebda.htm.

Blog Backup 

One of the main challenges I had were handling the asynchronous nature of the webBrowser.Navigate method. I had to setup an event handler for webBrowser_DocumentCompleted to set a flag when it happens. In the main code, I would wait in a loop, making sure events are being handled. Here’s a code snippet:

...

for (intPage = 1; !boolNoMorePosts; intPage++)
{
strPageURL = "https://blogs." + comboBlogType.Text + ".com/" + textBlogName.Text + "/default.aspx?p=" + intPage.ToString();
webBrowser.Navigate(strPageURL);
while (!boolLoaded) { System.Windows.Forms.Application.DoEvents(); };

...

The second interesting part was the handling of the HTML output in order to get just the posts out of it. It was actually not hard to use the GetElementsByTagName to get all the “<DIV>” tags and then look at each item in the list returned to check if it starts with “<DIV class=post>”. I also added some code to show the titles of the post (using a listBox control) as they were processed. Here's another code snippet:

...

colPageDivs = webBrowser.Document.GetElementsByTagName("div");
boolNoMorePosts = true;
for (intPost = 0; intPost < colPageDivs.Count; intPost++)
{
if (colPageDivs[intPost].OuterHtml.StartsWith("\r\n<DIV class=post>"))
{
boolNoMorePosts = false;
twOutput.WriteLine("<HR>" + colPageDivs[intPost].OuterHtml);
listPages.Items.Add((listPages.Items.Count+1)+ ":" + colPageDivs[intPost].GetElementsByTagName("A")[0].InnerHtml);
}
}

...

I'm sure there are more efficient ways to do this, but it worked fine for my requirement and it also performed well enough. I am also suspecting that the site probably does have an API and I just don't know where to find the documentation for it. However, it was a nice way to play with VS2010 and its new IDE.

Overall, there are a number of improvements on navigation and editing. Intellisense worked great as my primary method for learning about the different properties and methods of the objects in the solution. I hardly ever had to refer to the help pages. One of many small things that I found particularly interesting was how the IDE now highlights a word everywhere in the code when you select that word. It's very useful for reading the code.The entire IDE felt very responsive and I could not hit any bug myself. 

There are a couple of things I would still like to do, like fixing the title link (it’s a relative link, not an absolute one), address the fact that the pictures are not captured (ideally I would back those up as well) and add some error handling code. I am planning next to experiment with some of the new technologies like Silverlight or Windows Azure.