Here are some notes on “SQL Server 2008 Integration Services (SSIS)” I took while attending an advanced class on SQL Server taught by Greg Low (from http://sqlblog.com/blogs/greg_low/ and http://www.sqldownunder.com/).
Please note that, although these notes were taken during the class, I might have added some of my own (mis)interpretation :-). Always check your facts on Books Online (I try to provide links when applicable). As with anything else you get from a blog, never use any of this in production before you thoroughly validate it a test environment. Please keep in mind that some of those will be hard to follow without some pre-requisite knowledge and the right context. Reading the post from top to bottom will help.
SQL Server Integration Services (SSIS)
- See http://www.microsoft.com/sqlserver/2008/en/us/integration.aspx
- Tooling moved to Visual Studio
- When installing, select Business Intelligence Development Studio
- Templates, including Integration Services Connection
- See http://msdn.microsoft.com/en-us/library/ms141026.aspx
- See http://msdn.microsoft.com/en-us/library/ms141711.aspx
- Upgrade from DTS addressing previous issues: performance, language, all-in-one design surface
- Separating: Control Flow, Data Flow, Event Handlers, Package Explorer
- See http://msdn.microsoft.com/en-us/library/ms137973.aspx
- Open Visual Studio
- Create new project
- In Control, drag a two items from the toolbox (execute process task)
- Write notes to the control surface, text for documentation
- Noting the red (X) indicating something is missing
- Adding properties for each task (run calc.exe)
- Two calculators running at the same time, boxes turn yellow (running)
- Execution Results – your best friend, good place to look at what happened
- Package Explorer – Looking at the two tasks created
- Creating a sequence – green line means execute on success
- Check properties of the the green line: success/failure/completion
- Adding a third task. One runs on completion, one on success.
- Adding a fourth task. Depends on another. Two starting points now.
- Options to select two lines, configure properties, change from AND to OR…
- Unlike DTS, looping containers. Processing file by file in a batch, for instance.
- Looking at the Foreach.
- Set the collection property. For each file, For each time, For each SMO (database).
- Drop a couple task inside.
- Much simpler than DTS, no need for specific variables, etc.
- Two tasks, data flow in the middle, links
- Some data flows from task to task.
- Data flow has its own tab, since the data flowing can be harder to describe.
- In the Toolbox: Data Source (like ADO.NET), Data Transformation.
- Drop an ADO.NET source
- Connection – look at objects, switch to SQL command, write a query
- Look at preview – very helpful
- Look at columns – can change the name in the “Output Column”
- Error output – What do you want to do when error occurs (you can fail, redirect, send to error output)
- There’s also a connection object
- Drop a Flat File Destination, connect the two.
- Flat File – Has a connection also (?!). File name, format, columns, preview
- Mapping – What input column goes to what output file columns
- Run the package. Data flow is red… Not good.
- Look at Progress… Can’t find column at the data source.
- Look at data source and mappings again.
- Run again, it’s happy…
- Look at transformations: Derived column.
- Drop a derived column transformation.
- Properties, add as new columns, create an expression.
- Type an expression to add two columns, typecast for integer, return as Unicode.
- Expressions, typecasts (WSTR=Unicode)
- Add new flat file destination, now shows the derived column
- Run again. Note – run in data flow tab also runs everything, not just data flow.
- Looking at the resulting text file
- Go to SSMS, create a new database
- Create TableA (ID int identity, description varchar(30))
- Create TableB (ID int , description varchar(30))
- Insert a few rows in table A.
- Now back Visual Studio, to move data from TableA to TableB
- Set up the connection to the database
- Add ADO.NET data source, use a SQL command to
- Different between ADO.NET destination and SQL database?
- Let’s try ADO.NET destination… Not really what we want.
- Trying SQL Server destination… Create new connection. Select database, TableB.
- Mapping are OK. Validation error: cannot convert between Unicode and non-Unicode (?!)
- DTS didn’t have any type safety … Now we do!
- Note – Everything it does internally is Unicode… Description is coming up as Unicode.
- How to fix? Add a data conversion. Input, Output, Data Type… Fix the mapping. Run. OK!
- You can create your down data source with a script
- Source, Destinatio or Transformation.
- Looking at the properties of the script (script, input columns, inputs and ouput, connection)
- Script language options: Visual Basic 2008, Visual C# 2008
- You can then drop to code and write whatever you want (you get a starting template).
- You can also have a script task.
- Note – This ends up as compile code, so it has good performance
- Note – You can also write your own components in .NET
Also in the toolbox
- Sources: XML, Excel
- Destination: Data Mining model, SQL Server CE
- Transformations: Most are straightforward
- Using the fuzzy look transformation
- Only enterprise edition
- Looking for similar columns.
- You set the “similarity threshold” under “advanced”
- Useful for data de-duplication.
- Lookup transformation
- Important for looking up data in data warehouse
- Looking for the key that’s different in the DW (surrogate keys in DW environment)
- Inferred members – I have a fact about the member, but I don’t know the member
- Inferred members in the error output in the lookup
- Slowly changing dimensions – Customer region as it is now or as it was then?
- Type two slowly changing dimension – Surrogate keys for column, associated with a start/end date
- Lookup performance is important, since you might have several lookups as you bring data in
- Optimizations - Cache mode: full cache (connection type), partial cache, no cache
Other transformations in data flow
- Merge Join – sort order
- FTP – Basic, considering writing your own with .NET
- Web Service task – Also basic, remember you have full .NET support here…
- Whole set of system variables
- Add your own variables
- Can be scoped to package or component
- Scripts: ReadOnly or ReadWrite variables
- Project files, can add different types of files.
- Looking at project properties. Can create a deployment utility.
- Use the build option. Create a package.
- Looking at the files that are part of it (.sln, .dtsx).
- In bindeployment – bits to deploy
- DTS package is just an XML file. You can open it with notepad.
- Config and SSISDeploymentManifest are also XML.
- Copy the files, right-click the manifest and click deploy, create desired agent jobs.
- You can hard code connections in the package. Could be trouble down the line.
- Consider abstracting those.
- Check enable package configuration.
- Where do you want to store the configuration? Default is an XML file, good choice
- Other options: SQL Server, parent package variable, Registry, Environment variable
- Configuration properties: Long list of items, including database connections, working directory
- In connection, for instance, you can check the box for “Server Name”
- Look at XML configuration file. It still says <DTSConfiguration> 🙂
- You can change it and run again with a custom value as you run in a different environment
- Design for that upfront. Don’t put that on later…
- You could build the entire package in code.
- SSIS Developer Center in MSDN
- See http://msdn.microsoft.com/en-us/sqlserver/cc511477.aspx
- Webcasts on SSIS on MSDN site
- See http://www.sqlis.com/
- Codeplex examples
- See http://ssisctc.codeplex.com/
- Consider your source code control options.
- Improved Package Upgrade
- New ADO.NET Component
- New Data Profiling Task and Data Profile View – understand the data
- Connections Project Wizard – for connection-only projects
- Script Environment
- Improved Data type handing in Import/Export wizard
- Faster bulk loading
- Support for date/time data types, higher precision options
- Support for MERGE statement – option working row by row
- Support for Composable SQL – OUTPUT clause to track inserted, deleted rows, $action
- Upgrade advisor: Run before any upgrade (from DTS or SSIS 2005), reports on issues
- Package Migration Wizard: Creates new SSIS package, leaves the old one alone
- Package Migration Wizard – Careful – Can’t really deal with certains things, like transactions
- From SSIS 2005: Careful – Package upgrade is one-way only
- From DTS 2000: Consider using the DTS 2000 package task, dtsrun.exe still available
- From DTS 2000: Careful – ActiveX script tasks
- From DTS 2000: Careful – Connection related issues
- From DTS 2000: You really should consider re-writing … There are much simpler ways now!
- Consider works on SSIS 2008 ahead of time, before you install the SQL Server 2008 database
- Design-time validation – Tries to connect to data sources to validated, can work offline
- Set breakpoints, locals, call stack, immediate window, row count transform
- For the data flow, consider using only a subset of data
- Data Viewer – Helps you see the data as it is flowing through
- Data Viewers - Commonly used with a Grid, also available as data charts
- Logging, log providers (SQL, text, etc..)
- Consider adding your own logging using the logging infrastructure
- Consider using package-level event-handler for OnError event (send e-mail, perform task)
- In SSMS – Connect to Integration Services (not associated with a specific SQL instance)
Change Data Capture
- Ideal for populating data warehousing
- Allows incremental loads
- Debug dump files (.mdmp and .tmp) provide info about what happened
- Consider using LSN to track instead not time (there are functions to convert)
- Return as BINARY(10), need to deal with conversions
- Books online example
- See http://msdn.microsoft.com/en-us/library/bb895315.aspx
- Layered security. Both SQL and File System-based scenarios.
- Packages can be encrypted (levels), digitally signed, stored in SQL, secure with file system ACLs
- Can use SQL Server Agent to run SSIS Packages
- See http://support.microsoft.com/kb/918760/
- Where to store credentials? Consider storing the sensitive credential info in a database
- Keep packages simple, readable. Use comments, naming conventions.
- Consider dividing things into multiple packages, multiple containers.
- Understand the business: Kimball’s ETL and SSIS books
- Start with some blueprint design patters or ETL frameworks
- See Erik veerman, rushabh Mehta, pass summit 2008
- Careful – Have enough threads but understand asynchronous tasks, parallelism
- Consider using more T-SQL statements, MERGE statement comes to mind
- Careful – Avoid SELECT * and defining just “table”, which does a SELECT *
- Careful – Don’t bring data you don’t need: filter data in data source, use T-SQL there as required
- Careful – Consider optimizing the queries you use, covering indexes with INCLUDE
- Careful – Use IsSorted if input is ordered. SSIS won’t know otherwise…
- Consider using FastParse for text data that are clean
- Understand the data types returned by the databases, queuing systems
- Design your deployment strategy from the beginning, use configurations.
- Buffers: SSIS uses large buffers to optimize the process
- Careful - Width of the row used to define the size of buffer
- Consider removing output components to optimize buffer use
- Partially blocking components: Merge, Merge Join, Union All
- Blocking components: Sort, Aggregate
- CPU: Pipeline with multiple threads at once. Control with EngineThreads, MaxConcurrentExecutables
- Transactions: Default isolation is Serializable. Do you really need that?
- Discussion – Where to run SSIS – On same database server or another server? It depends 🙂
Data Warehouse Loading
ETL – Extract, Transform, Load
- ELT sometimes makes more sense…
- Can do some work at extract time, use advanced queries, typecasts
- Dimensions – Loaded first typically. Consider using multiple packages, run in parallel.
- Slow changing dimensions. Some built-in support. Might need to customize…
- Discussion – What to do with inferred members?
- Careful – Lookup transform is case sensitive
- Snowflakes – Dimensions that are subsets of other dimensions. Loading order is important
- Fact tables – Load dimensions first. Consider some advanced Lookup strategies, like MRU caching
- Data mining – SSIS can help populate a data mining model. Data mining query component.
Top 10 SQL Server Integration Services Best Practices
SSIS Team blog