Data Provider Built in Excel

Table of Contents

Two terms you may come across when dealing with data analysis are data source and data provider. A data source is where the data comes from and a data provider is a solution that maps the fields of the data source and puts it into a format that can be used for analysis and reporting. Transforming data from a source into a format that can be used as a provider can be challenging and often requires expensive and complex middle wear applications. In this case, part of the challenge was that 1) a long term solution is on the way, but probably 12 months away from final delivery; and 2) the data had to be passed to regional offices for them to assess their data and be able to cross compare to other regions where necessary.

Microsoft Access and Excel provide excellent quick development and easy deployment capabilities and licensing existed for both, but Access was discounted as staff are not familiar with its functionality and the IT team baulk as deployment of Access databases. So an Excel solution was selected.

Transforming from source to provider

Ok, so what is the key difference between a data source and a data provider?

DataProvider

In this case, the data source was a series of different tables in a workbook. The table format is great for doing data entry, but how can we report against one provider, let alone twenty. The first step is to recognise that each value in the table is a data point that requires a set of parameters to identify it.

DataSource1
In the example above, consider the value of “72829”. In this format we know who it relates to as it is in the workbook from Vendor A. We know the type of vehicle, the period, the date and the KPI it relates to, but take that one value and put it in a table with a bunch of other values, then mix that with values from other vendors, and you will quickly lose visibility of what that value actually means. What we need to do is transform the data from this data entry format and put it into a tabular format.

DataProvider2
So if we look at the tabular format, we can see we have our values on the far right, but now, with each value we have all the information that defines that value. The vendor, the area, the period, scope and KPI. Once the data is in this format, we can happily mix it with data from other vendors and we have full flexibility to build pivot tables, use formulas and any other Excel functionality we choose the really dig in and understand the data.

 First we find the workbooks

The format and structure of the the workbooks sent out was fully locked down, which makes life a lot easier when you need to import data. The one thing that is variable and out of our control, is the naming of the workbooks. So I had to build some flexibility into this part of the solution. I also wanted to give the business the flexibility to import from what ever folder they choose.

This is achieved with two separate pieces of functionality. the first to choose the folder, the second to get the files from the folder. To choose the folder, I used the file picker function.

Application.FileDialog(msoFileDialogFilePicker)

This is a little odd given that we are in fact only selecting a directory and as such could have used the “msoFileDialogFolderPicker” option, but early testing with the users indicated that they preferred to be able to see the files that are in the folder, which is not possible with the folder picker version.

When the user selects a file through the dialogue, I simply strip the file name from the full path.

intChar = InStrRev(strFilePath, "")
strPath = Left(strFilePath, intChar)

The InStrRev function looks for a given character, in this case the backslash, from the right hand end of the string and returns it’s position in the string as a numeric value. A trap for the unwary here is that despite the fact that it searches from the right, it gives the displacement from the left. For example, in the string “FolderFile”, even though the function would start looking from the “e” at the right side of the string and scan left until it found the “”, it would return the value of 7 as it is seven characters from the left. This one has caught me out on more than one occasion.

I captured the position of the backslash in the variable intChar and then used that in the Left function to strip off everything past the last backslash, which is the file name.

I then write the value to the main worksheet for the importing process so we can recall it and use it as desired. I also write a hyperlink so the user can quickly access the directory to add files or just to check them.

'write the file path to the reports sheet
With Worksheets("Reports")
   .Range("R_Path").Value = strPath
   .Hyperlinks.Add Anchor:=Range("R_Path"), Address:=strPath, SubAddress:="", TextToDisplay:=strPath
End With

The next bit of functionality is to find those workbooks in the director, and I did a couple of extra value adds with this.

I used the Dir function to get the file names, then I wrote the file names to the main worksheet for two reasons. One is to allow the user to confirm that we are seeing all the files in the directory, the second is that I added a cell next to each file name with a validation drop down that allows them to select if the workbook should be imported.

Sub Write_File_Names(strPath As String)
    Dim strFile As String
    Dim intRow As Integer
   
    'load the files from the import directory
    strFile = Dir(strPath & "*.xl*")
    With Sheets("Reports")
        .Unprotect Password:=g_strPWD
        'set the first validation option to Yes
        .Range("F30").Value = "Yes"
         'then delete all the other lines to clean things up
        .Range("B31:F100").Delete Shift:=xlUp
        
       'Set the first row for writing the names
        intRow = 30

       'While we keep finding files in the directory
        Do While strFile <> ""
            'write the file name    
            .Cells(intRow, 2).Value = strFile
            'and copy down the validation
            .Cells(30, 6).Select
            Selection.Copy
            .Cells(intRow, 6).Select
            ActiveSheet.Paste
           
            intRow = intRow + 1
        ' Use Dir to get the next file name.
            strFile = Dir()
        Loop
        .Protect Password:=g_strPWD
    End With
End Sub

So now we know where to find the files and which files we want to import.

Opening the Workbook

The getting of data involves the use of several nested loops:

Loop through all workbooks
Loop through all worksheets
Loop through each data field

Looping through the workbooks uses the same code as described earlier for getting the file names. This time though there are a couple of extra steps.

First we check if the workbook should be included in the import process, based on the validation selector next to the file name on the main sheet. If we don’t want to include it, I used a GoTo function to jump to the next worksheet. Because we don’t know where to find the file name in the list, the test for whether to include the workbook requires a loop through the list to find the workbook name and then check if the validation selector is set to Yes or No. It then passes the result back as a Boolean true or false.

'check if this workbook should be included in the import
If Include_Import(m_wbDBWorkBk, m_strFile) = False Then
     'if not, go to the next workbook
      GoTo Next_Workbook
End If

Function Include_Import(wsD As Workbook, strWS As String) As Boolean
    'This function checks the lis to determine if the workbook should be included
    Dim intRow As Integer
   
    intRow = 7
    Include_Import = False
   
    With wsD.Sheets("Reports")
        'check if the workbook should be imported
        intRow = 30
        While .Cells(intRow, 2).Value <> ""
            'scan the list to find the right company and are combination
            If (.Cells(intRow, 2).Value = strWS) Then
               'then check if it should be imported
               If .Cells(intRow, 6).Value = "Yes" Then
                    Include_Import = True
                    Exit Function
                Else
                    Include_Import = False
                    Exit Function
               End If
            End If
            intRow = intRow + 1
        Wend
    End With
End Function

Next we need to check if the workbook is in fact a valid workbook. There is potential for a vendor to send in the incorrect workbook, so rather than have the user open and check every workbook, the code does it for them. This is done simply by checking for information that should be in a specific place if the workbook is correct. In this case I used the company, area and From Date information that should appear on the Front Sheet of the imported workbook.

'Get the company and area info
 m_strCompany = m_wbImpWorkBk.Worksheets("Front Page").Range("D7")
 m_strArea = m_wbImpWorkBk.Worksheets("Front Page").Range("D9")
 m_dteStartDate = m_wbImpWorkBk.Worksheets("Front Page").Range("E13")
           
 'check if we can find the right information on the front Page sheet
 If m_strCompany = "" And m_strArea = "" And m_dteStartDate = #1/1/1900# Then
  'If we can't, it is possible that the import worbook is not the right template
        varResponse = MsgBox("The import workbook " & m_wbImpWorkBk.Name & " does not appear to be the right template." & vbCrLf & vbCrLf & _
         "Do you want to skip this workbook?", vbYesNo, "Incorrect Format")
          If varResponse = vbYes Then
                GoTo Next_Workbook
          End If
  End If

The code does give the end user the option to continue with the import, but it is likely to fail anyway. It is possible that one piece of information is missing, so this gives the user the chance to check the workbook, fix the missing data and continue with the import.

Then we need to assess if the workbook has already been imported. This is achieved by comparing the company name, area and the from date to information that is already recorded in the destination workbook.

'Check if this workbook has already been imported
boolAlreadyImported = Check_If_Already_Imported(m_strCompany, m_strArea, m_wbDBWorkBk, m_dteStartDate)
         If (boolAlreadyImported = True) Then
                varResponse = MsgBox("Workbook " & m_wbImpWorkBk.Name & " for " & m_strCompany & " covering period from " & m_dteStartDate & " appears to have already been imported." & vbCrLf & vbCrLf & _
      "Do you want to overwrite the data for this workbook?", vbYesNo, "Duplicate Import")
      If varResponse = vbYes Then
            Call Delete_Old_Data
      Else
            GoTo Next_Workbook
      End If
End If

Where it is found that the workbook has been imported already, the user is given the option to either skip this workbook and move on to the next one, or to overwrite the old data. This is achieved by finding all records in the destination workbook that align with this workbook to be imported and deleting it before then importing the new data.

Working through the worksheets

Now we have a workbook open, we’ve confirmed it is the right format, we want to include it and either we haven’t imported it before or we want to overwrite the old date. Next we have to work our way through the worksheets and import the data.

The basic line of code is simple, using a For Each loop to step through each worksheet in the source workbook. I also named the destination worksheets in the data provider workbook the same to simplify matching the source to the destination.

For Each m_SummarySheet In .Worksheets

The data source workbooks contain 21 worksheets and we only want to import data from 17 of them, and each worksheet for collecting the data is a different format, so it is necessary to identify the worksheet and select the appropriate method of importing the data. Since our data collection workbook provides a controlled environment where we know the names of all the worksheets, I simply used a Select Case model

 'find the right import process for each worksheet
 Select Case m_strWorksheet
     Case "Worksheet Name 1"
          'check if we should import it
          If .Worksheets("Reports").Cells(7, g_intSD).Value = "No" Then GoTo Next_Worksheet
          ' Set a variable to the sheet you want to read the data from.
          Set m_SourceSheet = m_wbImpWorkBk.Worksheets(m_strWorksheet)
          Call Get_Worksheet Name_1
                       
     Case "Worksheet Name 2"
          'check if we should import it
          If .Worksheets("Reports").Cells(7, g_intFC1).Value = "No" Then GoTo Next_Worksheet
          ' Set a variable to the sheet you want to read the data from.
          Set m_SourceSheet = m_wbImpWorkBk.Worksheets(m_strWorksheet)
          Call Get_Worksheet Name_2

There are a couple of thing’s going on in this code. 1) Clearly the Select Case is identifying the name of the worksheet and passing the matching ones through to the code for the right Case. 2) the first thing that is done after that is a check to see if the worksheet should be included, this functions in the same way as the inclusion or exclusion of the workbook, but in this case I know where the value will be stored, so it is a direct reference with Cells”(7, g_intSD)” with no need for a loop to find it. 3) Once we ascertain that the worksheet should be included, we set a module variable to track the worksheet name for further processing, then 4) we pass it off to the appropriate import sub routine.

Getting the Data

This is where the bulk of the effort went in, each worksheet is in a different format, so an import routine had to be written for each one. The basic functionality is that each data point must be stepped through sequentially.

WalkThroughPattern

The basic code for doing something like that is something like:

For m_lngSRow = 9 To 10 Step 1
        For m_intSCol = 3 To 5 Step 1
               <<get data here>>
        Next m_intSCol
Next m_lngSRow

You’ll note there is a double loop here, one looping through the rows using the m_lngSRow variable, the other stepping through the columns using the “m_intSCol” variable. Each steps one row or column after each loop and it fits a nice tabular structure such as the one we are trying to create. But note in the example source data provided, there are two row gaps between each row of data values and then a four row gap when we jump to the next set of data for Vehicle Type 2. To compensate for this, I took a rather inelegant, but effective path, I employed an If statement to test the values of the 3rd column and the data point itself.

If IsNumeric(m_SourceSheet.Cells(m_lngSRow, m_intSCol).Value) And m_SourceSheet.Cells(m_lngSRow, 4).Value Like "KPI*"

By doing this, I can step through every row, but only capture the values I actually want. Column 4 must have the character string “KPI” in it and the value must be numeric before the IF statement will parse as true.

As I said, there are seventeen different worksheets to be imported and seventeen different formats. So there are seventeen different import models overcoming different challenges in the data structure. Some used row number identification to cause the row to jump to the next data set such as “IF m_lngSRow = 23 Then m_lngSRow = 27”. Others lent themselves to simple patterns, each had to be considered, defined and adjusted until the desirable outcome was achieved.

Writing the data provider format

The metadata surrounding each data point comes from different places. Some remain the same throughout the workbook, such as the company name, area and start date that come from the Front Page of the source workbooks. The rest of the metadata is unique for each worksheet data set. Capturing this data was another good challenge in managing the setting of variable values. Consider the example we have used so far.

Metadata Metadata
In this case we have the data point outlined in red and all the metadata (excluding company, area and from date) outlined in different colours for illustration. The Green outlined item changed when the data point row changed while the blue outlined item changed when the data point column changed. The orange item had to change each time the import moved to the next sub-table and the purple items only changed when the import moved to new major sections of the sub-tables.

Once the data points and metadata is captured, writing the actual data is straight forward.

For m_lngSRow = 18 To m_lngLastRow Step 1
            m_SummarySheet.Cells(m_lngDRow, 1).Value = m_strCompany 
            m_SummarySheet.Cells(m_lngDRow, 2).Value = m_strArea 
            m_SummarySheet.Cells(m_lngDRow, 3).Value = m_dteStartDate 
            m_SummarySheet.Cells(m_lngDRow, 4).Value = m_SourceSheet.Cells(m_lngSRow, 2).Value 
            m_SummarySheet.Cells(m_lngDRow, 5).Value = m_SourceSheet.Cells(m_lngSRow, 3).Value 
            m_SummarySheet.Cells(m_lngDRow, 6).Value = m_SourceSheet.Cells(m_lngSRow, 4).Value
            m_SummarySheet.Cells(m_lngDRow, 7).Value = m_SourceSheet.Cells(m_lngSRow, 5).Value
            m_SummarySheet.Cells(m_lngDRow, 8).Value = m_SourceSheet.Cells(m_lngSRow, 6).Value
            m_SummarySheet.Cells(m_lngDRow, 9).Value = m_SourceSheet.Cells(m_lngSRow, 7).Value 
            'test if any cells are empty
            Set m_rngRange = m_SummarySheet.Range("A" & m_lngDRow & ":H" & m_lngDRow)
            For Each m_clX In m_rngRange
                If IsEmpty(m_clX.Value) Then g_boolIncomplete = True
            Next m_clX
            m_lngDRow = m_lngDRow + 1
    Next m_lngSRow

Note here as we loop through the rows of data points, we write the values to the destination sheet. In some cases writing the value of variables such as “m_SummarySheet.Cells(m_lngDRow, 1).Value = m_strCompany”, in other cases copying the value directly from the source sheet to the destination sheet “m_SummarySheet.Cells(m_lngDRow, 8).Value = m_SourceSheet.Cells(m_lngSRow, 6).Value”.

Note also that the process carries out two tests during the import. The first is to walk through each line after import and check if there is data missing, if there is it sets a Boolean flag for incomplete data.

For Each m_clX In m_rngRange
       If IsEmpty(m_clX.Value) Then g_boolIncomplete = True
Next m_clX

The other test (not shown in the code snippet) is simply to assess if the record count has gone up, if it hasn’t then there was no data imported and a flag is set to indicate there were no records.

If m_lngDRow = m_lngNoRecordCount Then g_boolNoRecords = True

These two flags are used when the import report is written to indicate if a worksheet was in complete or data was missing altogether. This significantly cuts down on the review time for the user as they don’t have to check every worksheet in every workbook.

What is the benefit?

Why take the time to build something like this, and it does take quite a bit of time I assure you. Well there are a few benefits.

  1. The most immediate is that this solution can import, reformat and check every data point in less than an hour. It takes that long because one worksheet alone has over 11,000 data points, without including that sheet, the rest of the sheets import in just 4 minutes.
  2. The quality of data will benefit through this more thorough testing. to ensure there isn’t any missing data.
  3. Prior to implementing this solution, only two reports were generated out of the data and it required hours of work to open each workbook, copy specific data points into another workbook then copy the result to a Word document then save the report in PDF format before reporting. The solution can now complete that same cycle, producing four reports, in 12 seconds and there is no risk or data entry error.
  4. Most of the data provided was never reported on as it was inaccessible, moving the data into a data provider format now makes the reporting easy and limitless.

While this type of work is not the core business of PME4U, the use of Office automation techniques has allowed us to provide our clients with robust and effective solutions to their data capture, management and analysis challenges and it is a key aspect of being effective in developing procurement strategies and tactical implementation.

For more information, feel free to leave a comment below, join our forums or contact us directly via the Contact PME4U web page.