Extracting Data from Unstructured Flat Files Using SSIS

Total Page:16

File Type:pdf, Size:1020Kb

Extracting Data from Unstructured Flat Files Using SSIS

1. Introduction

SQL Server Integration Services (formerly Data Transformation Services) provides a comprehensive solution for transferring and transforming data between diverse data sources. The architecture of SSIS has been redesigned to separate package-control flow from data flow. This article is mainly deals about extracting the data from unstructured flat files.

2. Flat file with double quotes

2.1 Text file

Here when loading data using SQL Server Integration Services (SSIS) we are importing data from a CSV file. Every single one of the columns in the CSV file has double quotes around the data. We have to import the data by removing the double quotes.

Here is the sample CSV file as it looks in a text editor. We can see that all of the columns have double quotes around the data even where there is no data. The file is comma delimited, so this should give us enough information to import the data column by column.

"114343","John","Hyderabad India","","PIN 8767878" "114344","Will Smith"," Bangalore India","","PIN 456666" "114345","James","Delhi India","","PIN 7898999" "114346","Linda","Vizag India","","PIN 45656656" "114347","Stuart","Pune India ","","PIN 66758768" "114348","Raj","Chennai India","","PIN 1414445" "114349","Jason","Jaipur India","","PIN 6787878"

Justin Antony 2.2 Creating Data flow task

To create the package we use a Data Flow Task and then use the Flat File Source as our data flow source.

Justin Antony 2.3 Flat file source

When setting up the Flat File Connection for the data source we enter the information below, basically just selecting our source file.

If we do a quick preview on the dataset we can see that every column has the double quotes even the columns where there is no data.

Justin Antony On this screen we can see the highlighted area and the entry that is made for the "Text qualifier". Here we enter in the double quote mark “and this will allow SSIS to strip the double quotes from all columns

Justin Antony If we do another preview we can see that the double quotes are now gone and we can move on to the next part of our SSIS package development.

As mentioned above this is a simple fix to solve this problem. This same technique can be used to strip any other text qualifier data from flat files.

Justin Antony 3. Flat file with varied columns

3.1 Text file: The Flat File Connection Manager is used to define the format of an external file in order that we can load its contents into the SSIS pipeline. Usually such files will contain rows that all have the same format but sometimes they don't - and that's where the problems start. Here's an example of a file. The file is comma delimited, so this should give us enough information to import the data column by column. The CSV file is look like

0, 1, 2

1,2,3,4,5,6,7

1, 2,3,4,5

A, B, C, D, E, F, G

3.2 Extracting using SSIS Import/Export wizard:

We can import the flat file using SSIS Import/Export wizard

 Right click on SSIS packages

 Select SSIS Import Export wizard

 Click Next Button

 Select Data source as Flat file source.

 Browse the text file

There were issues loading the file using the SSIS Import and Export Wizard, however, because not all of the rows in the file had the same number of columns.

Justin Antony 3.3 Develop a script task: So to get this file loaded properly in SSIS, we need to develop our own package with a Script Component. Below is the layout for the Data Flow

Justin Antony Here's the flat file connection manager. Note that we are not using a column delimiter - the approach here is to load all the data for each row into its own column and break it apart in the Script Component.

Within the properties for the Script Component, we have one input column (Column0) and create one output column for each possible column in the file.

Justin Antony Imports System Imports System.Data Imports System.Math Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper Imports Microsoft.SqlServer.Dts.Runtime.Wrapper Public Class ScriptMain Inherits UserComponent Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer) Row.Col1 = Tokenise(Row.Column0, ",", 1) Row.Col2 = Tokenise(Row.Column0, ",", 2) Row.Col3 = Tokenise(Row.Column0, ",", 3) Row.Col4 = Tokenise(Row.Column0, ",", 4) Row.Col5 = Tokenise(Row.Column0, ",", 5) Row.Col6 = Tokenise(Row.Column0, ",", 6) Row.Col7 = Tokenise(Row.Column0, ",", 7) End Sub 'Private function that parses out the columns of the whole row Private Function Tokenise(ByVal input As String, ByVal delimiter As String, ByVal token As Integer) As String Dim tokenArray As String()

Justin Antony tokenArray = input.Split(delimiter.ToCharArray) 'Split the string by the delimiter If tokenArray.Length < token Then 'Protect against a request for a token that doesn't exist Return "" Else Return tokenArray(token - 1) End If End Function End Class

3.4 SQL Destination:

Create one table with all the columns mapped to flat file for migration.

CREATE TABLE RESULT (Col1 VARCHAR(5),Col2 VARCHAR(5),Col3 VARCHAR(5), Col4 VARCHAR(5),Col5 VARCHAR(5),Col6 VARCHAR(5), Col7 VARCHAR(5))

Justin Antony 3.5 Execution: To run this package: 1. Click on the Debug menu and click Start Debugging. 2. After the package has completed running, on the Debug menu click Stop Debugging. 3. The data in your flat text file has now been imported.

After it has successfully run, we can check the destination table

Justin Antony 4. Flat file with varied columns with various data types:

4. 1 Flat file:

The example below shows a sample file that uses a comma to delimit the columns, and a cursor return / line feed to delimit the row.

TestValue1,100,12/01/2007 TestValue2,200 TestValue3,300,12/01/2007 TestValue4,400,12/01/2007 TestValue5,500 TestValue6,600,12/01/2007 TestValue7,700,12/01/2007 TestValue8,800 TestValue9,900,12/01/2007 TestValue0,1000,12/01/2007

SSIS does not handle this scenario easily, due to the way it parses flat files. It parses by looking for the next column delimiter. The row delimiter is just the column delimiter for the last defined column. So, on our second line in the sample file, SSIS is looking for a comma instead of a CR/LF. The result of this is that the third row ends up combined with the second row, and we get something that looks like this

Justin Antony

4.2 Flat file connection Manager:

Now we define a flat file connection manager that treats each row as one column. I'm using the row delimiter (CR/LF) as the column delimiter. Now the flat file should preview like this

Justin Antony

4.3 Script

Transformation Task:

Next, in a data flow, we`ve added a flat file source that uses the connection manager. It is connected to a script component that is set as a Transform. The column0 is checked as an input.

Justin Antony In the Inputs and Outputs area, we've added three columns, for the three real columns in our flat file, and set the data types appropriately

.

Public Class ScriptMain Inherits UserComponent

Justin Antony Private columnDelimiter() As Char = CType(",", Char()) Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer) ' ' Add your code here ' Dim rowValues As String() rowValues = Row.Column0.Split(columnDelimiter)

If rowValues.GetUpperBound(0) < 2 Then 'Row is not complete - Handle error Row.Name_IsNull = True Row.Value_IsNull = True Row.Date_IsNull = True Else Row.Name = rowValues.GetValue(0).ToString() Row.Value = Convert.ToInt32(rowValues.GetValue(1)) Row.Date = Convert.ToDateTime(rowValues.GetValue(2)) End If End Sub

End Class

The Column Delimiter variable holds the value for the column delimiter - a comma in this case. The Split function parses the value contained in Column0 (the single column defined in the connection manager) and returns an array containing one element for each column in it. Since we’re expecting 3 columns, we`re performing a check to see if the array contains all three columns (.NET uses 0-based array indexes). If columns are missing, we have an error that needs to be handled. In this example, but here We`re simply setting all the column values to NULL. The error handling could be enhanced by redirecting the rows to an error output,

Finally, if the correct numbers of columns are present, we`re setting the output columns created earlier with the values from the array. Notice that the Convert is necessary to make sure the value is the correct type.

Justin Antony 4.4 SQL Destination:

Add a SQL Destination task to migrtae all the Records from the text file

The SQL Table is : CREATE TABLE Test (TestNumber INT,TestName VARCHAR(50),TestDate DateTime)

4.5 Execution:

To run this package: 1. Click on the Debug menu and click Start Debugging. 2. After the package has completed running, on the Debug menu click Stop Debugging. 3. The data in your flat text file has now been imported.

Justin Antony 4.6 Using Derived transformation Task:

4.6.1 Adding Conditional split task:

The Conditional Split determines what type of row we`re dealing with, and passes it to the appropriate output. It does this by checking how many delimiters appear in the row. The Find String function will return a 0 if the string specified is not found, or if the string specified Occurs less than the number of occurrences specified.

Justin Antony 4.6.2 Adding Derived Transformation task:

Now that we know how many columns we need to parse, we`re using a Derived Column transform to split the columns from the main string.

The expression for the first column looks for the first occurrence of the delimiter.

SUBSTRING(Line,1,FINDSTRING(Line,",",1) - 1)

For the second column, the expression is a bit more complicated. It has start from the first delimiter, and stop at the second. Since the SubString function needs the length, the expression is calculating the difference between the first and second delimiter. In addition, it is casting the result to an integer.

Justin Antony (DT_I4)(SUBSTRING(Line,FINDSTRING(Line,",",1) + 1,FINDSTRING(Line,",",2) - FINDSTRING(Line,",",1) - 1))

Finally, the third expression finds the second delimiter, and gets the rest of the string. I'm taking a shortcut by using the full value for the length, since if the length argument is exceeds the length of the string, the rest of the string is returned.

(DT_DBTIMESTAMP)(SUBSTRING(Line,FINDSTRING(Line,",",2) + 1,LEN(Line)))

Finally, a Union All is used to combine the data back into a single flow.

Technically, this could be accomplished without the Conditional Split. However, the logic required for the Derived Column transform would be much more complex, as each column parsing expression would have to be wrapped in a conditional expression to see if that column actually existed for the row.

4.6.3 SQL Destination:

Add a SQL Destination task to migrtae all the Records from the text file

The SQL Table is : CREATE TABLE Test (TestNumber INT,TestName VARCHAR(50),TestDate DateTime)

4.6.4 Execution:

To run this package: 1. Click on the Debug menu and click Start Debugging. 2. After the package has completed running, on the Debug menu click Stop Debugging. 3. The data in your flat text file has now been imported.

Justin Antony After it has successfully run, we can check the destination table

SELECT * FROM Test

Justin Antony 5. Text files with different Column delimiters in different rows:

The following is an example of text file with different Column Delimiters

P0001,Product 1 P0002,Product 2 P0003,Product 3 P0004;Product 4 P0005,Product 5 P0006,Product 6 P0007;Product 7 P0008,Product 8 P0009;Product 9 P00010,Product 10

We can see that above two columns are separated either by comma or semi-colon.

5.1 Adding Flat file connection Manager:

After creating a package project and new package in the project, create a connection manager named Text File. In the General section of the connection manager, you need to give the path for the text file.

Next we need to select columns options. Select row delimiter as {CR}{LF}.

Justin Antony 5.2 Adding Script Task:

Next task is to configure the Script component. We have three options to configure - Input Columns, Input and Outputs and Script. we can update the Output Alias as Line.

As we can see on the input columns tree node, there is only one element which is Line. Line was defined in the Input Columns. Next select Output Columns and add two columns, code and description by clicking the Add Column button. Allocate the correct data type and length for each column. In this case I have selected string [DT_STR] as data type and 50 as the data length. An important configuration of the script component is Script option. Select the Script option and click the Design Script button. We will be taken to the Microsoft Visual Studio to add necessary .NET code.

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer) ' ' Add your code here ' Dim strRow As String Dim strColSeperator As String Dim rowValues As String() strRow = Row.Line.ToString() If strRow.Contains(",") Then strColSeperator = (",") ElseIf strRow.Contains(";") Then strColSeperator = ";" End If

rowValues = Row.Line.Split(CChar(strColSeperator)) Row.Code = rowValues.GetValue(0).ToString() Row.Description = rowValues.GetValue(1).ToString()

End Sub

End Class

Above is the only code we need to add. From the Contains function we will identify the column separator for the row. Then using split function and passing the correct column delimiter we can separate the two columns.

Justin Antony 5.3 Execution:

Here we can see the output with data viewer. We can see that data was separated despite containing different column delimiters.

6. Text files with different Column delimiters in same row:

The following is an example of text file with different Column Delimiters

P1;P0001,Product 1 P2;P0002,Product 2 P3;P0003,Product 3 P4;P0004,Product 4 P5;P0005,Product 5 P6;P0006,Product 6 P7;P0007,Product 7 P8;P0008,Product 8 P9;P0009,Product 9 P10;P00010,Product 10

Justin Antony The first two columns are separated by semi colon while the other two columns are separated by a comma.

6.1 Adding Flat file connection Manager:

After creating a package project and new package in the project, create a connection manager named Text File. In the General section of the connection manager, you need to give the path for the text file.

Next we need to select columns options. Select row delimiter as {CR}{LF}.

Justin Antony 6.2 Adding Script Task:

Next task is to configure the Script component. We have three options to configure - Input Columns, Input and Outputs and Script. we can update the Output Alias as Line.

We will need to add another column to the output columns, namely Column Delimiter.

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer) Dim strRow As String Dim strColSeperator As String Dim rowValues1 As String() Dim rowValues2 As String()

rowValues1 = Row.Line.Split(CChar(";")) Row.ColumnDelimeter = rowValues1.GetValue(0).ToString() rowValues2 = rowValues1.GetValue(1).ToString().Split(CChar(",")) Row.Code = rowValues2.GetValue(0).ToString() Row.Description = rowValues2.GetValue(1).ToString()

End Sub 6.3 Execution:

Here we can see the output with data viewer. We can see that data was separated despite containing different column delimiters.

Justin Antony Justin Antony

Recommended publications