Thursday, June 16, 2011

SSIS - Excel import with unknown number of columns

I have a bunch of xls files I want to import into a table. I don't know the name of the files, the name of the first Sheet, nor the amount of columns. So here is at least one way to do it.

Step 1: Foreach Loop Container
I'll use the Foreach Loop Container in the Control Flow to pick up each unknown files.

Step 1a: Add the Foreach Loop Container


Step 1b: Select the folder and set the Files to *.xls. I used an expression to set the directory so I can set that dynamically. I'm pulling it from a SQL table.


Step 1c: In the Variable Mappings, set a Variable to the 0 Index. This will be the full URI to the file.


Step 2: Get the name of the first Sheet to know what to pull in.

Step 2a: Add a Script Task inside the Foreach Loop Container in Control Flow.


Step 2b: Add Variables to be used in the script


*** UPDATE *** : If you want to add XLSX to this also, then you will need to download and install this driver from Microsoft. I've updated the code (2c and 3g) to include reference to this driver for XLSX files.

Step 2c: Add this script to get the first Sheet name.
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;
using System.Xml;
using System.Data.OleDb;
using System.Diagnostics;

namespace ST_8f8a8b14140a4ded9dcbbf6aa5b97090.csproj
{
[System.AddIn.AddIn("ScriptMain", Version = "1.0", Publisher = "", Description = "")]
public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
{

#region VSTA generated code
enum ScriptResults
{
Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success,
Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
};
#endregion

public void Main()
{
string excelFile;
string connectionString;
OleDbConnection excelConnection;
DataTable tablesInFile;
int tableCount = 0;
string currentTable;

try
{
Trace.WriteLine(string.Format("Sheet Name: {0} File Location: {1}", Dts.Variables["SheetName"].Value, Dts.Variables["ImportFilePath"].Value));
string extension = Path.GetExtension(excelFile);
extension = extension.Replace(".", "");

if (extension.ToLower() == "xlsx")
{
connectionString = "Provider=Microsoft.ACE.OLEDB.12.0;" +
"Data Source=" + excelFile + ";Extended Properties=Excel 12.0";
}
else
{
connectionString = "Provider=Microsoft.Jet.OLEDB.4.0;" +
"Data Source=" + excelFile + ";Extended Properties=Excel 8.0";
}
excelConnection = new OleDbConnection(connectionString);
excelConnection.Open();
tablesInFile = excelConnection.GetSchema("Tables");
tableCount = tablesInFile.Rows.Count;
if (tableCount > 0)
{
currentTable = tablesInFile.Rows[0]["TABLE_NAME"].ToString();

Dts.Variables["SheetName"].Value = currentTable;

Dts.TaskResult = (int)ScriptResults.Success;

Trace.WriteLine(string.Format("Sheet Name: {0}",Dts.Variables["SheetName"].Value));
}
else
{
Dts.Variables["ErrorCode"].Value = "Excel Sheet Lookup Failure. -- No Sheets Found.";
Dts.TaskResult = (int)ScriptResults.Failure;
}
}
catch (Exception e)
{
Dts.Variables["ErrorCode"].Value = "Excel Sheet Lookup Failure. -- " + e.Message;
throw new Exception("Excel Sheet Lookup Failure.", e);
}

}
}
}



Step 3: Create a Data Flow to handle importing the data in.

Step 3a: Add the data flow task. Then enter that data flow.


Step 3b : Add a Script Component - I finally went with the Script Component because I couldn't get an Excel Source Task to work consistantly. If the file had formating or had less columns than I preset in the Excel Source Task, I'd get issues.


Step 3c: Select Source as the script type


Step 3d: Add Variables with the file location and sheet name


Step 3e: No Connection Manager is needed if you are going to work with the imported data in the data flow before inserting it into a destination Connection Manager, like ADO SQL connection.


Step 3f: Add Output Fields - I added 25 String[DT_STR] fields 255 characters long.


*** UPDATE *** : As mentioned in the comments, if the code gets stuck on the excelReader.Read, then you may need to move the call to GetDataFromExcelToReader(Variables.ImportFilePath) from the PreExecute method to the CreateNewOutputRows method, just before the while loop.

Step 3g: Add this script
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using System.IO;
using System.Data.SqlClient;
using System.Data.OleDb;
using System.Diagnostics;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
private OleDbDataReader excelReader;
private OleDbConnection excelConnection;
private OleDbCommand excelCommand;

public override void PreExecute()
{
base.PreExecute();

// Open connection
GetDataFromExcelToReader(Variables.ImportFilePath);
}

public override void PostExecute()
{
base.PostExecute();

excelReader.Close();

excelConnection.Close();
}

private void GetDataFromExcelToReader(string p_strFileName)
{

string l_strConnectionString;

if (File.Exists(p_strFileName))
{
string extension = Path.GetExtension(p_strFileName);
extension = extension.Replace(".", "");

if (extension.ToLower() == "xlsx")
{
l_strConnectionString = "Provider=Microsoft.ACE.OLEDB.12.0;" +
"Data Source=" + p_strFileName + ";Extended Properties=\"Excel 12.0;HDR=NO;IMEX=1\";";
}
else
{
l_strConnectionString = "Provider=Microsoft.Jet.OLEDB.4.0;" +
"Data Source=" + p_strFileName + ";Extended Properties=\"Excel 4.0;HDR=NO;IMEX=1\";";
}
excelConnection = new OleDbConnection(l_strConnectionString);
excelConnection.Open();
excelCommand = excelConnection.CreateCommand();
excelCommand.CommandText = "SELECT * FROM [" + Variables.SheetName.ToString() + "]";
excelCommand.CommandType = CommandType.Text;
excelReader = excelCommand.ExecuteReader();

}
}

public override void CreateNewOutputRows()
{
while (excelReader.Read())
{
Output0Buffer.AddRow();
Output0Buffer.F1 = excelReader.FieldCount > 0 ? excelReader[0].ToString() : "";
Output0Buffer.F2 = excelReader.FieldCount > 1 ? excelReader[1].ToString() : "";
Output0Buffer.F3 = excelReader.FieldCount > 2 ? excelReader[2].ToString() : "";
Output0Buffer.F4 = excelReader.FieldCount > 3 ? excelReader[3].ToString() : "";
Output0Buffer.F5 = excelReader.FieldCount > 4 ? excelReader[4].ToString() : "";
Output0Buffer.F6 = excelReader.FieldCount > 5 ? excelReader[5].ToString() : "";
Output0Buffer.F7 = excelReader.FieldCount > 6 ? excelReader[6].ToString() : "";
Output0Buffer.F8 = excelReader.FieldCount > 7 ? excelReader[7].ToString() : "";
Output0Buffer.F9 = excelReader.FieldCount > 8 ? excelReader[8].ToString() : "";
Output0Buffer.F10 = excelReader.FieldCount > 9 ? excelReader[9].ToString() : "";
Output0Buffer.F11 = excelReader.FieldCount > 10 ? excelReader[10].ToString() : "";
Output0Buffer.F12 = excelReader.FieldCount > 11 ? excelReader[11].ToString() : "";
Output0Buffer.F13 = excelReader.FieldCount > 12 ? excelReader[12].ToString() : "";
Output0Buffer.F14 = excelReader.FieldCount > 13 ? excelReader[13].ToString() : "";
Output0Buffer.F15 = excelReader.FieldCount > 14 ? excelReader[14].ToString() : "";
Output0Buffer.F16 = excelReader.FieldCount > 15 ? excelReader[15].ToString() : "";
Output0Buffer.F17 = excelReader.FieldCount > 16 ? excelReader[16].ToString() : "";
Output0Buffer.F18 = excelReader.FieldCount > 17 ? excelReader[17].ToString() : "";
Output0Buffer.F19 = excelReader.FieldCount > 18 ? excelReader[18].ToString() : "";
Output0Buffer.F20 = excelReader.FieldCount > 19 ? excelReader[19].ToString() : "";
Output0Buffer.F21 = excelReader.FieldCount > 20 ? excelReader[20].ToString() : "";
Output0Buffer.F22 = excelReader.FieldCount > 21 ? excelReader[21].ToString() : "";
Output0Buffer.F23 = excelReader.FieldCount > 22 ? excelReader[22].ToString() : "";
Output0Buffer.F24 = excelReader.FieldCount > 23 ? excelReader[23].ToString() : "";
Output0Buffer.F25 = excelReader.FieldCount > 24 ? excelReader[24].ToString() : "";
}
}
}



Step 4: Use the data you've imported as needed. Massage it. Push it into SQL tables. Merge it with other data.

21 comments:

  1. This method don't work on Windows Server 2008 x64 and .xlsx files.
    it's block in this :
    public override void CreateNewOutputRows()
    {
    while (excelReader.Read())
    {
    //CODE
    }
    }

    ReplyDelete
  2. You are right. I was just looking at XLS files.

    But now you peaked my interest. So I've found what was needed to handle both XLS and XLSX files. You will see the update in the post and its code.

    - Mick

    ReplyDelete
  3. I found a solution myself :
    Just put GetDataFromExcelToReader(Variables.ImportFilePath) into CreateNewOutputRows() before while execution.

    ReplyDelete
  4. Hi again...

    I just want to mention i don't refer to excel connection string i do that :
    if (extension.ToLower() == "xlsx")
    {
    l_strConnectionString = "Provider=Microsoft.ACE.OLEDB.12.0;" +
    "Data Source=" + p_strFileName + ";Extended Properties=\"Excel 12.0;HDR=NO;IMEX=1\";";
    }
    else
    {
    l_strConnectionString = "Provider=Microsoft.Jet.OLEDB.4.0;" +
    "Data Source=" + p_strFileName + ";Extended Properties=\"Excel 4.0;HDR=NO;IMEX=1\";";
    }

    I refer to part of code from CreateNewOutputRows() .. It just block on it (executer only one line and frezee) and solution found was to put GetDataFromExcelToReader(Variables.ImportFilePath) into CreateNewOutputRows() before while execution and excelReader.Close();,excelConnection.Close();,excelConnection.Dispose(); after while excecution.
    In other worlds I put all methods into CreateNewOutputRows()

    I hope you understand what i am talking..:)
    I am not sow good on English.

    ReplyDelete
  5. Thank you.

    I'm not sure what is different as I'm running SQL 2008R2 x64 and pulling in .XLS and .XLSX files. But if you're having a problem with it then I'm sure others will too.

    ReplyDelete
  6. I don't have Excel software on my system. Will this code work?

    ReplyDelete
  7. Dinesh - As long as you have the drivers mentioned (Microsoft.Jet.OLEDB.4.0 and Microsoft.ACE.OLEDB.12.0) on the server, then this should work fine. This is how the built in Importer within MS SQL handles it.

    ReplyDelete
  8. Thanks. I am having code like this...

    Output0Buffer.f1= oSheet.Range("B2").Value

    oSheet is not recognised becuase its not declared. How do I declare this or what is the way to get range function.

    ReplyDelete
  9. Dinesh - It looks like your oSheet is the same as my excelReader. You will see my excelReader declared just inside the ScriptMain class and then initialized at the end of the GetDataFromExcelToReader function. Finally, in CreateNewOutputRows, you'll see how I step through its data with the excelReader.Read().

    Hope this helps.

    ReplyDelete
  10. Hi Mick, I have different excel workbooks. Say, in workbook1-india,workbook2-uk. I want to read only these sheets having these names. How do I do with above coding. What I need to tweak?

    ReplyDelete
    Replies
    1. Under Step 2c, in the Main function, you'd want to change this code:

      if (tableCount > 0)
      {
      currentTable = tablesInFile.Rows[0]["TABLE_NAME"].ToString();
      Dts.Variables["SheetName"].Value = currentTable;
      Dts.TaskResult = (int)ScriptResults.Success;
      Trace.WriteLine(string.Format("Sheet Name: {0}",Dts.Variables["SheetName"].Value));
      }

      to something like this:

      for (int i=0; i < tableCount; i++)
      {
      currentTable = tablesInFile.Rows[i]["TABLE_NAME"].ToString();
      if currentTable = SheetNameIAmLookingFor
      {
      Dts.Variables["SheetName"].Value = currentTable;
      Dts.TaskResult = (int)ScriptResults.Success;
      Trace.WriteLine(string.Format("Sheet Name: {0}",Dts.Variables["SheetName"].Value));
      break;
      }
      }

      I switched to a for loop. I changed the zero to i in the Rows array. And you'll have to put something in for the SheetNameIAmLookingFor variable.

      You'll need to fix the error checking a little too.

      Delete
  11. Hi,
    Where do you put Script Component for Step 3b? On the Data Flow tab? But I don't have such component. I have BIDS 2008.

    Another problem. I have xslx file with filter and it turns out that 1st table is actually a filter with the name = _xlnm#_FilterDatabase and only the next is first sheet name. So, I think it is better to add a code in Get Excel Task step that looks for the first name than ends with '$' in a name.

    ReplyDelete
    Replies
    1. CFTForEver,

      If you double click on the Data Flow object you created in 3a, it will take you into that Data Flow. Then on the left, there will be a whole new list of objects. That is where you'll find the Script Component in 3b.

      Delete
  12. Hi i know this solution is good But does this actually load the Text values from excel or does this just load the internally stored data. By this what i mean is Suppose i have column with money values like $123 $1234 etc and date fields like 1-jan-2012 etc will it load 1-jan-2012 or 1\1\2012.I am doing this with C# and i am actually able to do this but the performance is very slow since its going to each field and grabbing the text value. I was hoping if there is a better way to do that.

    ReplyDelete
    Replies
    1. JIYO,

      I've done some testing with XSLX files. It pulls in what you see. So if you've typed a date into excel as 8/1/2012, but then you change the format so it looks like 1-Aug-2012, then the 1-Aug-2012 will be pulled in. If you change the format of that date field to 08/01/2012, then it will pull in 08/01/2012. The same thing is true for any of the data formatting. If you have it show the $ in the currency field, then the $ will be pulled in. If you format a field to have "-" in a SSN field, it is going to pull in with the "-"'s even if you didn't type the "-"'s.

      Hope that makes since.

      Delete
  13. Thanks mate. Your scripts helped a lot with tackling excel dogma.

    ReplyDelete
  14. Mick,

    I'm getting an error on the script. The variable InputFilePath contains a long UNC path, and I am getting the error when it tries to pass it into routine:

    private void GetDataFromExcelToReader(string p_strFileName)

    The error is:

    Script Component has encountered an exception in user code:
    Object reference not set to an instance of an object.

    This works with shorter paths, but I need to be able to send a full UNC and not a mapped drive letter.

    Thoughts>

    ReplyDelete
  15. It appears to be a limitation on the length allowed in the OleDBConnection, not the string varianble. This still fails when the UNC path is greater than 70 characters:

    excelConnection = new OleDbConnection(Variables.ConnString);

    I am going to work around with using a temp folder with a shoter path.

    ReplyDelete
  16. Hello, Can you please tell me if this will work if excel has more than 256 columns?

    ReplyDelete
  17. HI, I am failing at the Script Task (" get excel tables ") and the error is "DTS Script task : Run Time error , Cannot load script for execution." can you please help me out

    ReplyDelete
  18. At the end of public override void CreateNewOutputRows() you forgot to close the connection ( excelConnection.Close();)

    It gave me a hell of a problem, because I was unable to delete the processed file

    ReplyDelete