Simply Fed! Develop a simple data feed from a third party database. By Tony Rogerson

Article
04/12/2005

Tony Rogerson , SQL Server MVP
Chairman, UK SQL Server Community

Torver Computer Consultants Ltd

http://www.sqlserverfaq.com

Introduction

There have been many occasions where I have been asked by clients to do some sort of reporting or analysis of data that physically resides on a third party system often managed by the third party themselves. The biggest difficulty this poses is the possible affect reporting directly against the third party database may present to the online production system, this might be performance problems or could even cause locking contention on the production box.

My main experience is writing financial feeds from various internal and external systems from various databases, Oracle, Sybase, SQL Server and even Microsoft Access. The world of finance is unique, if it breaks you severely get your bottom kicked, problems may cost £manyK’s.

This article will show a simple approach I use to create an effective platform for providing reporting and analysis without too much effect on the production database.

Review what you can do

See what your third party supplier will allow you to do. Some are switched on and will give you limited access to the production database, although they will probably ask to review what you are doing to make sure you aren’t trying to drag across a 30 million row table every 5 minutes. Some will be very backward and offer you a file export, probably not the changes but of everything!

The key here is to be diplomatic, hopefully you will have a good working relationship with your supplier, be they an internal department or external company.

Try and get past the salesman and talk to somebody technical, the biggest cause of problems is poor communication of what you are trying to do. Before approaching your data vendor have a simple requirements document showing at a high level what you are trying to do, both in terms of what the end result is and what you are expecting of the data vendor.

Technique

The optimal solution is to bring across as little data as possible - the changes that have happened since the feed last run. This is quite difficult. Consider how you handle updates and especially deletes? If the source system uses an incrementing ID as a surrogate key then you’re ok on inserts otherwise more thought is required.

Trickle Feed Data (Source Database Techniques)

There are a number of techniques for handling this; we will discuss two that I’ve used myself. This section shows techniques you can use on the source database; we discuss the destination database later.

Timestamp Column – Source Database

This is probably the best solution but it only copes with inserts and updates, some more thought needs to be put into deletes.

This involves an additional column if its not already there on the source system that holds the date and time that row was updated.

You store, by table, the highest date/time brought across. You don’t use the current date/time, you cannot be sure that the source and destination server clocks are in sync. When pulling date the SQL would then be something like SELECT … FROM … WHERE last_updated >= {last pull highest date/time}. Use >= because you may have missed some rows with the same date/time – unlikely, but it could happen!

For deletes you will need to use a ‘Trigger Table’ to store the deleted rows primary keys (and date/time deleted) – see next.

Trigger Table – Source Database

This involves a central table with a structure something similar to that below:-

TableName varchar(128) not null,

Operation char(1) not null check( ‘U’, ‘I’, ‘D’ ),

PK_1 varchar(100) not null,

PK_2 varchar(100) null,

PK_3 varchar(100) null,

EntryDate datetime not null default( getdate() )

A trigger is required on each table in the feed on the source database, the trigger writes into the trigger table when an Insert, Update or Delete occurs. The PK_1 to PK_3 would store the effected rows values, for single column primary keys you would use just PK_1, for composite keys you would use _2, _3 etc… I’ve used varchar(100) instead of something like int because character primary keys won’t convert to int but integers do convert to varchar. If you only have integer keys then use int instead. Remember you will need to CAST the primary key value to varchar if it’s not already – watch for size, although a primary key of 100 bytes isn’t common in my experience. The best thing to do is to use surrogate keys instead, that is, put an auto-incrementing number of the table and put a unique constraint on it.

EntryDate is used by the feed to know what its brought across, it works in the same way as the timestamp column – don’t use the date/time the feed runs, use the date/time for the last pulled row instead ie. MAX( EntryDate ). The SQL for the feed would then be something like this…

SELECT tt.EntryDate, bt.Col1, bt.Col2 etc…

FROM TriggerTable AS tt

INNER JOIN BaseTable bt ON bt.yourprimarykey = tt.PK_1

WHERE tt.EntryDate >= { MAX(EntryDate) from previous feed pull }

Name the columns on the SELECT, never use the notation SELECT * - that is so dangerous; you’ve no idea what the order of columns coming over is!

Trickle Feed Data (Destination Database Techniques)

There are lots of potential problems to overcome on the destination database, source database schema changes, being blamed for performance problems, causing performance problems, how to minimise the amount of data brought across the link.

Schema (Table Structures) – Source and Destination

How do you overcome the situation when your provider changes column sizes, types, removes or adds columns?

My favourite method is to use VIEWS over the source system so you can define your own interface rather than being directly exposed to the source system schema.

You can use NULLable columns (except for the primary key) for your destination tables this will help you deal with situations where the column has been dropped from the source system. If a column is dropped you would simply change the view to set the column to NULL.

Keeping check on Performance

Use an audit log to keep track of execution statistics, when the feed starts to pull data, finishes and how many rows brought across. If the source database ever has performance problems this will help diagnose the fault. It’s quite easy for a third party to blame a feed when the problem actually lies else where!

You can also build this into a feed monitor, for instance, if data hasn’t been pulled across for a particular table for a set duration then almost certainly there is a problem and you can raise an alert.

Minimising Data Transfer

I’ve already mentioned ways on the source database to capture when rows are updated. We now need to make use of these on the destination database.

My favourite method for implementing a feed is to have a number of database tables that hold the Meta data for the tables, columns and keys I’m pulling across. I then use dynamically built SQL to execute the feed logic. The benefit of this approach is that you don’t have lots and lots of stored procedures that are logically the same but differ only in the table name and columns they bring across. The other benefit is that your testing can be concentrated on a smaller amount of code.

The frequency at which the feed is executed needs to be determined baring in mind how long the feed takes to run, this will depend on how much data is brought across. I tend to go for 30 or 15 minute intervals.

Destination Processing

Don’t directly populate your destination base tables, use staging tables instead. You’ll find it easier to test and debug problems, it also reduces locking contention. If you populate directly your transaction will be open a lot longer – the time it takes to bring the data across, that is dependant on the source database performance and network performance.

The order of applying modifications/inserts to your destination base tables is DELETES then UPDATES then INSERTS. If you did UPDATES after INSERTS then you would probably end up updating rows that you’ve just inserted thereby increasing IO and load. The SQL will be something like below:-

DELETE basetable

WHERE primarykey IN ( SELECT primarykey FROM stagingtable )

UPDATE basetable

SET col1 = st.col2, ….

FROM stagingtable AS st

WHEER basetable.primarykey = st.primarykey

INSERT basetable ( col1, …. )

SELECT col1, …

FROM stagingtable AS st

WHERE NOT EXISTS (

SELECT *

FROM basetable bt

WHERE bt.primarykey = st.primarykey

)

Using Microsoft SQL Server

I prefer using SQL Server because it offers a very flexible environment for easily getting data out of multiple environments be they Oracle, Sybase, Text files, XML or on an AS400.

You can use ‘Linked Servers’ to gain access into external systems and query them using SQL.

For more complex feeds or data requirements you can use Data Transformation Services (DTS), this allows you to code using VBScript.

Another beauty about SQL Server is that you get very powerful reporting and business intelligent tools out of the box. SQL Server Reporting Services can be used to provide very powerful reports and web interaction. SQL Server Analysis Services offers a comprehensive business intelligence solution with data mining and pivot tables amongst a host of other features.