Sunday, April 23, 2017

Loading files with dynamic names in ODI

Suppose we have to load files named like ‘HR_YYYYMMDD.XML’, where YYYYMMDD represents the date of the file formation which values are differentiate within executions.

We would pass variable as a filename in Topology which will be set dynamically depending on start-up parameters.

Before we begin, make sure you have already created appropriate schema in Topology and reverse engineered required data models from these files.

Step 1
Create a variables ‘FILENAME’, ‘DIRECTORY’ in a project named HR and pass them into JDBC Url of data server in Topology.
For instance:
jdbc:snps:xml?f=#HR.DIRECTORY/#HR.FILENAME&re=Employees&ro=true&case_sens=true&dod=true&standalone=true

Create a variables for date bounds: ‘DATE_BEGIN’, ‘DATE_END’, and for run id ‘RID’.

Create a new procedure which would list files in directory, select ones with names between specified date range, and write those lines into text file.
We would choose Jython as a Target Technology for this task.
In the Details view of a procedure add a new step, for example:
import sys
import os
import fnmatch
from datetime import date, timedelta, datetime

date_begin = datetime.strptime('#DATE_BEGIN', '%Y%m%d')
date_end   = datetime.strptime('#DATE_END', '%Y%m%d')
date_delta = date_end - date_begin

dates = []
for i in range(date_delta.days + 1):
    dt = date_begin + timedelta(days=i)
    dates.append(dt.strftime("%Y%m%d"))

filenames = []
for f in os.listdir('#DIRECTORY'):
    for dt in dates:
        if fnmatch.fnmatch(f, 'HR_*'+dt+'*.XML'):
            filenames.append(f)

ftxt =  open('#DIRECTORY/#RID.txt','w')
for fname in filenames:
    ftxt.write(fname+'\n')
ftxt.close()

Step 2
Now we need to retrieve n-th line from a text file into variable.

Create (or use existing) File technology; dataset with one string column only; temporary interface from this dataset (just to get select query). Run it, and retrieve statement from logs.

We would end up here with something like:
select C1
from TABLE
/*$$SNPS_START_KEYSNP$CRDWG_TABLESNP$CRTABLE_NAME=#HR.RID.txtSNP$CRLOAD_FILE=#HR.DIRECTORY/HR.FILENAME.txtSNP$CRFILE_FORMAT=DSNP$CRFILE_SEP_FIELD=0x0009SNP$CRFILE_SEP_LINE=0x000ASNP$CRFILE_FIRST_ROW=#HR.vLineNumberSNP$CRFILE_ENC_FIELD=SNP$CRFILE_DEC_SEP=SNP$CRSNP$CRDWG_COLSNP$CRCOL_NAME=C1SNP$CRTYPE_NAME=STRINGSNP$CRORDER=1SNP$CRLENGTH=500SNP$CRPRECISION=500SNP$CR$$SNPS_END_KEY*/

Notice numeric variable vLineNumber (initial value is 0; would be incremented by 1 at package flow) as a SNP$CRFILE_FIRST_ROW value, and SNP$CRFILE_SEP_LINE=0x000A

Use this statement as a Select Query at Refreshing view of then FILENAME variable.

Step 3
Create an interface with XML dataset as a source and generate scenario with RID, DIRECTORY, FILENAME variables. If you don’t see those variables at scenario generation time – carefully add something dummy like nvl2('#FILENAME','','').  

Dive into new package flow. Here we would like to load files in parallel, hence execute scenario 8 times in background and wait for them before next 8 (supposed to use custom parallel-aware KMs).

In properties of ‘Execution of the Scenario’ step set “Asynchronous Mode” and keywords like HR_READ_XML_FILE_SCEN,#RID at General tab (the same keywords should be in ‘OdiWaitForChildSession’ step properties). At the Addition variables tab – HR RID #RID, and so on for FILENAME, DIRETORY.

We would also need another numeric variable like ‘vSemathore’.

And here is the final flow: