Showing posts with label integration. Show all posts
Showing posts with label integration. Show all posts

Sunday, April 23, 2017

Loading files with dynamic names in ODI

Suppose we have to load files named like ‘HR_YYYYMMDD.XML’, where YYYYMMDD represents the date of the file formation which values are differentiate within executions.

We would pass variable as a filename in Topology which will be set dynamically depending on start-up parameters.

Before we begin, make sure you have already created appropriate schema in Topology and reverse engineered required data models from these files.

Step 1
Create a variables ‘FILENAME’, ‘DIRECTORY’ in a project named HR and pass them into JDBC Url of data server in Topology.
For instance:
jdbc:snps:xml?f=#HR.DIRECTORY/#HR.FILENAME&re=Employees&ro=true&case_sens=true&dod=true&standalone=true

Create a variables for date bounds: ‘DATE_BEGIN’, ‘DATE_END’, and for run id ‘RID’.

Create a new procedure which would list files in directory, select ones with names between specified date range, and write those lines into text file.
We would choose Jython as a Target Technology for this task.
In the Details view of a procedure add a new step, for example:
import sys
import os
import fnmatch
from datetime import date, timedelta, datetime

date_begin = datetime.strptime('#DATE_BEGIN', '%Y%m%d')
date_end   = datetime.strptime('#DATE_END', '%Y%m%d')
date_delta = date_end - date_begin

dates = []
for i in range(date_delta.days + 1):
    dt = date_begin + timedelta(days=i)
    dates.append(dt.strftime("%Y%m%d"))

filenames = []
for f in os.listdir('#DIRECTORY'):
    for dt in dates:
        if fnmatch.fnmatch(f, 'HR_*'+dt+'*.XML'):
            filenames.append(f)

ftxt =  open('#DIRECTORY/#RID.txt','w')
for fname in filenames:
    ftxt.write(fname+'\n')
ftxt.close()

Step 2
Now we need to retrieve n-th line from a text file into variable.

Create (or use existing) File technology; dataset with one string column only; temporary interface from this dataset (just to get select query). Run it, and retrieve statement from logs.

We would end up here with something like:
select C1
from TABLE
/*$$SNPS_START_KEYSNP$CRDWG_TABLESNP$CRTABLE_NAME=#HR.RID.txtSNP$CRLOAD_FILE=#HR.DIRECTORY/HR.FILENAME.txtSNP$CRFILE_FORMAT=DSNP$CRFILE_SEP_FIELD=0x0009SNP$CRFILE_SEP_LINE=0x000ASNP$CRFILE_FIRST_ROW=#HR.vLineNumberSNP$CRFILE_ENC_FIELD=SNP$CRFILE_DEC_SEP=SNP$CRSNP$CRDWG_COLSNP$CRCOL_NAME=C1SNP$CRTYPE_NAME=STRINGSNP$CRORDER=1SNP$CRLENGTH=500SNP$CRPRECISION=500SNP$CR$$SNPS_END_KEY*/

Notice numeric variable vLineNumber (initial value is 0; would be incremented by 1 at package flow) as a SNP$CRFILE_FIRST_ROW value, and SNP$CRFILE_SEP_LINE=0x000A

Use this statement as a Select Query at Refreshing view of then FILENAME variable.

Step 3
Create an interface with XML dataset as a source and generate scenario with RID, DIRECTORY, FILENAME variables. If you don’t see those variables at scenario generation time – carefully add something dummy like nvl2('#FILENAME','','').  

Dive into new package flow. Here we would like to load files in parallel, hence execute scenario 8 times in background and wait for them before next 8 (supposed to use custom parallel-aware KMs).

In properties of ‘Execution of the Scenario’ step set “Asynchronous Mode” and keywords like HR_READ_XML_FILE_SCEN,#RID at General tab (the same keywords should be in ‘OdiWaitForChildSession’ step properties). At the Addition variables tab – HR RID #RID, and so on for FILENAME, DIRETORY.

We would also need another numeric variable like ‘vSemathore’.

And here is the final flow: