Lookup is one of many transformations available in SQL Server Integration Services (SSIS). Its purpose is to perform a JOIN of two datasets: input columns from a data source and columns in a reference dataset – a table, view or a result of a query. Lookup can be used in various scenarios but this post focuses on how to use it to implement a typical ETL package that performs incremental loads into a Data Warehouse.
Here’s a simplified description of how Lookup works: it tries to match each incoming record with one or more records in the reference table. If there is a match, the incoming record is sent to the match output. If there is no match, Lookup generates an error by default but it can also be configured to send the data to the no match output. There are some more considerations that will be described later on.
ETL PACKAGE WITH LOOKUP
Here’s a complete package that implements incremental updates to the Address table in a Data Warehouse. It connects to a source database, takes all Address records that were either added or modified in last 3 days and then, for each input record, it performs either INSERT or UPDATE in the target Address table:
Let’s take a closer look at each element of this package:
The ODBC data source is a MySQL database from which we’re importing recently added or updated Address records:
Configuring the Lookup Transformation starts by specifying that rows with no matching entries should be sent to the no match output:
That was all we need to do with the Lookup. The next step is to properly handle the two outputs to INSERT non-matched and UPDATE matched records.
After selecting the table, we must go to the Mappings tab and confirm that all columns are properly mapped. In our case where the source and target tables have identical structures, this mapping will happen automatically. When column names are different, however, we need to do manual mapping by dragging and dropping source columns over respective target columns:
The UPDATE part of this package requires some more work and even some coding. We will first write a stored procedure that updates a single record in the target table:
CREATE PROCEDURE [dbo].[Address_update] @address_id [INT] , @contact_fname [NVARCHAR](64) , @contact_lname [NVARCHAR](64) , @contact_phone [NVARCHAR](32) , @contact_phone_ext [NVARCHAR](16) , @address1 [NVARCHAR](128) , @address2 [NVARCHAR](64) , @city [NVARCHAR](64) , @state [NVARCHAR](48) , @postal_code [NVARCHAR](20) , @country_code [NVARCHAR](5) , @address_type [NVARCHAR](5) , @record_status [NVARCHAR](2) , @creation_time [DATETIME] , @last_update_time [DATETIME] , @shipping_address_type_code [NVARCHAR](6) , @address3 [NVARCHAR](64) , @address4 [NVARCHAR](64) , @last_update_timestamp [DATETIME] , @contact_phone2 [NVARCHAR](32) , @contact_fax [NVARCHAR](32) , @email [NVARCHAR](128) , @company [NVARCHAR](128) , @static_flag [SMALLINT] , @ups_daily_pickup_flag [TINYINT] , @default_shipping_location_flag [TINYINT] , @address_name [NVARCHAR](100) , @default_billing_location_flag [TINYINT] AS UPDATE [dbo].[Address] SET address_id=@address_id , contact_fname=@contact_fname, contact_lname=@contact_lname, contact_phone=@contact_phone, contact_phone_ext=@contact_phone_ext, address1=@address1, address2=@address2, city=@city, state=@state, postal_code=@postal_code, country_code=@country_code, address_type=@address_type, record_status=@record_status, creation_time=@creation_time, last_update_time=@last_update_time, shipping_address_type_code=@shipping_address_type_code, address3=@address3, address4=@address4, last_update_timestamp=@last_update_timestamp, contact_phone2=@contact_phone2, contact_fax=@contact_fax, email=@email, company=@company, static_flag=@static_flag, ups_daily_pickup_flag=@ups_daily_pickup_flag, default_shipping_location_flag=@default_shipping_location_flag, address_name=@address_name, default_billing_location_flag=@default_billing_location_flag WHERE address_id = @address_id
Back to our SSIS package, we will use the OLE DB Command transformation that runs a SQL statement for each row in the incoming data flow and, as you can guess, we’ll be calling the stored procedure that we just created. The first tab specifies the connection manager:
Then we go to the Column Mapping and link available input columns with available SP parameters. Unlike with INSERT, there will be no automatic mapping so we will have to manually draw lines from left to right:
There’s a couple of things to keep in mind when using Lookups:
- If there are multiple matches in the reference table, the Lookup transformation returns only the first match returned by the lookup query.
- If there is more than one input record with the same Id and those are not matched, the package will try to INSERT all of them and will fail due to the primary key violation. If we can’t change the data source and we want to tolerate this then we need to configure the Destination to either ignore errors or handle them as appropriate.
- The lookups performed by the Lookup transformation are case sensitive! If we JOIN on character data, we must use the Character Map transformation to convert the data to uppercase or lowercase.
- The reference datasets are cached in memory by default until the execution of the package is completed. Depending on the dataset size and available memory, we might need to disable caching to avoid out-of-memory errors.
Did you use Lookup in your ETL or other scenarios and are there any stories you’d like to share?