MVP Series: This is the Data You’re Looking For

For the month of October and November we are thrilled to have special guest authors from the Canadian MVP Award Program contributing posts around their favourite tips, tricks & features of SQL 2014. For the next few weeks, we will be posting a different article from one of our Canadian SQL Server MVPs each week. We hope you enjoy them, please feel free to leave a comment

Todd McDermidTodd McDermid is a Data Warehouse Lead with Metrie, a leading manufacturer and distributor of mouldings, doors, and interior finishing products in North America. He’s passionate about building up tools and breaking down barriers between the business and valuable insights derived from their own data.

As a SQL Server MVP who works with data warehouses all day, I’m a big fan of building resilient automation into every process. That includes where your data comes alive most often – in spreadsheets. It would be great if your users took advantage of new features, like those in Power BI. But honestly, most spreadsheets don’t – because enterprises are typically years behind the curve of adopting new technology, and analysts stick to tried-and-true techniques. So push for better use of what you have now as well – even if it’s Excel 2007 against a plain relational SQL Server database.

Spreadsheet users love to use VLOOKUP to help communicate their analyses to readers. Sometimes it’s used directly in the presentation layer to display an end result. Quite frequently, it’s used internally in the workbook to massage vast tables of data. But using it does have its weak points:

  • The “key” value you’re looking up has to be the first column in the range you specify to VLOOKUP.  This means you sometimes have to rearrange your data in a way to fit the function of the spreadsheet, rather than the function of the business purpose it’s used for – or duplicate and hide a bunch of columns.

  • The value you want returned specifies the column as an offset from the start of the range.  This means that if you ever insert or delete columns inside that range… your VLOOKUP (at best) breaks.  At worst, it still returns values… but completely wrong ones!  This happens more often than you might think...

  • You have to type out the formula.  Sure, you can use selection by mouse to fill in the ranges, but everything else is typed and prone to error.  Especially the return column index number.

Wouldn’t it be great if we could give those analysts a way to get what they wanted while adding in a little “enterprise-grade” resiliency into their automation? Specifically, I’m looking to provide methods that will fill similar use cases as VLOOKUP, but remove the weaknesses above.

Our Test Case

Who doesn’t want a Surface Pro 3? I’d take one… but which one? Let’s do some research to help us demonstrate the alternatives to VLOOKUP. First, I’ve made a dataset of Surface Pro 3 specs and prices by using Power Query pointed at a Wikipedia article (link: https://en.wikipedia.org/wiki/Microsoft\_Surface\_Pro\_3). After some cosmetic data manipulation, I’ve got a table in my spreadsheet. Download it from here (https://onedrive.live.com/redir?resid=C76C3CBDB305921!33316&authkey=!AMfU-vY7Tksnyms&ithint=file%2cxlsx) to follow along. The “Surface Pro 3 Dataset” sheet has the results of the Power Query, and “Traditional VLOOKUP” has a simple lookup defined on it.

Keep in mind this is for demonstration purposes only – if this were the real use case, there would be no point applying my suggestions – this data is too simple! But when you scale up to larger and more complex datasets, what comes next isn’t overkill – it’s necessary.

Improving VLOOKUP a Step at a Time

Step 1 - MATCH

One of the problems we mentioned with VLOOKUP is the column index parameter. If you change the Power Query to alter the column order, your VLOOKUP fails. Even worse… it may not be obvious how to fix it, because your VLOOKUP referred to the “old column” by a number, which may not give you enough information to figure out which “new column” you should be referring to.

This is easy to fix by using the MATCH function. Instead of typing a number in to the parameter, we’re going to use MATCH to find the column name we want, return the index of it, and have VLOOKUP use that.

Using MATCH changes this

To this

=VLOOKUP(CONCATENATE([CPU model],"-",[Internal Storage]),Models,7,FALSE)

=VLOOKUP(CONCATENATE([CPU model],"-",[Internal Storage]),Models,MATCH("Price Tier (USD)",Models[#Headers],0) ,FALSE)

That solves the VLOOKUP dependence on column order. The eagle eyed will now correctly point out that we have introduced a dependence on column name. This is still an improvement, because the function self-documents that you’re looking for price. If we happen to change the column name, we can use search and replace to help fix the dependent formulas. You can’t do that with a vanilla VLOOKUP.

Step 2 – INDEX and MATCH

Another problem we talked about with VLOOKUP is the range given in the second parameter. The limitation is that the “key” column must always be to the left of the value you’re looking for. Not a large constraint, but it can negatively affect the design of your tables, and the size of your workbook.

We can address that by using the INDEX and MATCH functions together rather than the VLOOKUP alone. MATCH is doing half the work – finding the row with the model; INDEX is doing the other half – returning the price on the row.

Using INDEX and MATCH changes this

To this

=VLOOKUP(CONCATENATE([CPU model],"-",[Internal Storage]),Models,7,FALSE)

=INDEX(Models[Price Tier (USD)],MATCH(CONCATENATE([CPU model],"-",[Internal Storage]),Models[Model Key],0))

We’ve solved several problems here. The first about column order is solved, because INDEX only needs to know the column of return values, and MATCH only needs to know the column of lookup values. We’ve solved the dependence on column order as in Step 1, just differently. Even better, the new function isn’t dependent on column name either. If we rename a column in our source table, the function fixes itself. We do still need that “model key” column in the source data however…

Use PivotTables instead of Tables, with GETPIVOTDATA

Now it’s time for something completely different. This could cause anxiety in your analysts, because it’s not VLOOKUP, or something very similar like INDEX. But it can be a LOT easier to use, and is very resilient to change. It does only work for values that you’re aggregating – it won’t do a plain lookup for a text value.

Your analysts probably use Pivot Tables from time to time already. (Sometimes it’s just because they still want to use VLOOKUP instead of a SUMIF.) I’ve seen analysts build a Pivot Table to summarize numbers, then use a VLOOKUP on it to rearrange the values into their presentation format, or as a basis for charting. Because when all you have is a hammer, everything looks like a nail.

This does involve creating a pivot table where we never needed one before – but as you know, in complex data situations, if your data resides in Excel, it is likely going to be summarized in a pivot table. Or, you’re querying a PowerPivot model, SSAS cube, or other source that can be presented in pivot table form.

Either way – looking up a value in a pivot table is going to be the easiest thing you’ve ever started – no documentation required. To get started, just press “=” in a cell, and go find the value you want. You’ll end up with something like this:

=GETPIVOTDATA("Price Tier (USD)",$F$1,"CPU model","i5","Internal Storage","128 GB")

You can’t build a VLOOKUP, INDEX, or MATCH that easily. From here, linking in the dynamics to match the other queries is simple – and very resilient to sheet changes.

Using GETPIVOTDATA changes this

To this

=VLOOKUP(CONCATENATE([CPU model],"-",[Internal Storage]),Models,7,FALSE)

=GETPIVOTDATA("Price Tier (USD)",$F$1,"CPU model",[CPU model],"Internal Storage",[Internal Storage])

A quick breakdown of the parameters to GETPIVOTDATA (see the full MSDN page here) is straightforward: The measurement we want (Price Tier), the location of the Pivot Table, then a string of filter name/value pairs. If you play with the function, you’ll see you can add or remove filters as much as you like (as long as you use legal names!) and it just works. The key point is that whichever filters you use – make sure that value is actually visible in the pivot table. If it’s not, the function won’t find the value and will return #REF. Any other manipulation of the pivot table itself – moving elements from rows to filters to columns makes no difference.

Tomatoe, Tomahto

There are many ways to get something done in Excel – we’ve just shown three ways to look up a price based on attributes. Download the sample workbook and try it out yourself – all of the alternatives work just fine. But keep resiliency to change in mind when you’re building your spreadsheets, and you’ll be happy when someone tells you the workbook you made years ago still just works.