SIMPLETHORA: integration

Loading files with dynamic names in ODI

Suppose we have to load files named like ‘HR_YYYYMMDD.XML’, where YYYYMMDD represents the date of the file formation which values are differentiate within executions.

We would pass variable as a filename in Topology which will be set dynamically depending on start-up parameters.

Before we begin, make sure you have already created appropriate schema in Topology and reverse engineered required data models from these files.

Step 1

Create a variables ‘FILENAME’, ‘DIRECTORY’ in a project named HR and pass them into JDBC Url of data server in Topology.

For instance:

jdbc:snps:xml?f=#HR.DIRECTORY/#HR.FILENAME&re=Employees&ro=true&case_sens=true&dod=true&standalone=true

Create a variables for date bounds: ‘DATE_BEGIN’, ‘DATE_END’, and for run id ‘RID’.

Create a new procedure which would list files in directory, select ones with names between specified date range, and write those lines into text file.

We would choose Jython as a Target Technology for this task.

In the Details view of a procedure add a new step, for example:

import sys

import os

import fnmatch

from datetime import date, timedelta, datetime

date_begin = datetime.strptime('#DATE_BEGIN', '%Y%m%d')

date_end = datetime.strptime('#DATE_END', '%Y%m%d')

date_delta = date_end - date_begin

dates = []

for i in range(date_delta.days + 1):

dt = date_begin + timedelta(days=i)

dates.append(dt.strftime("%Y%m%d"))

filenames = []

for f in os.listdir('#DIRECTORY'):

for dt in dates:

if fnmatch.fnmatch(f, 'HR_*'+dt+'*.XML'):

filenames.append(f)

ftxt = open('#DIRECTORY/#RID.txt','w')

for fname in filenames:

ftxt.write(fname+'\n')

ftxt.close()

Step 2

Now we need to retrieve n-th line from a text file into variable.

Create (or use existing) File technology; dataset with one string column only; temporary interface from this dataset (just to get select query). Run it, and retrieve statement from logs.

We would end up here with something like:

select C1

from TABLE

/*$$SNPS_START_KEYSNP$CRDWG_TABLESNP$CRTABLE_NAME=#HR.RID.txtSNP$CRLOAD_FILE=#HR.DIRECTORY/HR.FILENAME.txtSNP$CRFILE_FORMAT=DSNP$CRFILE_SEP_FIELD=0x0009SNP$CRFILE_SEP_LINE=0x000ASNP$CRFILE_FIRST_ROW=#HR.vLineNumberSNP$CRFILE_ENC_FIELD=SNP$CRFILE_DEC_SEP=SNP$CRSNP$CRDWG_COLSNP$CRCOL_NAME=C1SNP$CRTYPE_NAME=STRINGSNP$CRORDER=1SNP$CRLENGTH=500SNP$CRPRECISION=500SNP$CR$$SNPS_END_KEY*/

Notice numeric variable vLineNumber (initial value is 0; would be incremented by 1 at package flow) as a SNP$CRFILE_FIRST_ROW value, and SNP$CRFILE_SEP_LINE=0x000A

Use this statement as a Select Query at Refreshing view of then FILENAME variable.

Step 3

Create an interface with XML dataset as a source and generate scenario with RID, DIRECTORY, FILENAME variables. If you don’t see those variables at scenario generation time – carefully add something dummy like nvl2('#FILENAME','','').

Dive into new package flow. Here we would like to load files in parallel, hence execute scenario 8 times in background and wait for them before next 8 (supposed to use custom parallel-aware KMs).

In properties of ‘Execution of the Scenario’ step set “Asynchronous Mode” and keywords like HR_READ_XML_FILE_SCEN,#RID at General tab (the same keywords should be in ‘OdiWaitForChildSession’ step properties). At the Addition variables tab – HR RID #RID, and so on for FILENAME, DIRETORY.

We would also need another numeric variable like ‘vSemathore’.

And here is the final flow:

This note is based on articles by Cezar Santos at ODIExperts.com:
“How to refresh ODI variables from file – Part 1 – Just one value”;
“How to refresh ODI variables from file – Part 2 – Getting all lines, once at time”;
“Getting one or several unknown files from a directory”.

Sunday, April 23, 2017

Loading files with dynamic names in ODI