Loading files with dynamic names in ODI
Suppose we have to load files named like ‘HR_YYYYMMDD.XML’, where YYYYMMDD represents the date of the file formation which values are differentiate within executions.
We would pass variable as a filename in Topology which will be set dynamically depending on start-up parameters.
Before we begin, make sure you have already created appropriate schema in Topology and reverse engineered required data models from these files.
Step 1
Create a variables ‘FILENAME’, ‘DIRECTORY’ in a project named HR and pass them into JDBC Url of data server in Topology.
For instance:
jdbc:snps:xml?f=#HR.DIRECTORY/#HR.FILENAME&re=Employees&ro=true&case_sens=true&dod=true&standalone=true
Create a variables for date bounds: ‘DATE_BEGIN’, ‘DATE_END’, and for run id ‘RID’.
Create a new procedure which would list files in directory, select ones with names between specified date range, and write those lines into text file.
We would choose Jython as a Target Technology for this task.
In the Details view of a procedure add a new step, for example:
import sys
import os
import fnmatch
from datetime import date, timedelta, datetime
date_begin = datetime.strptime('#DATE_BEGIN', '%Y%m%d')
date_end = datetime.strptime('#DATE_END', '%Y%m%d')
date_delta = date_end - date_begin
dates = []
for i in range(date_delta.days + 1):
dt = date_begin + timedelta(days=i)
dates.append(dt.strftime("%Y%m%d"))
filenames = []
for f in os.listdir('#DIRECTORY'):
for dt in dates:
if fnmatch.fnmatch(f, 'HR_*'+dt+'*.XML'):
filenames.append(f)
ftxt = open('#DIRECTORY/#RID.txt','w')
for fname in filenames:
ftxt.write(fname+'\n')
ftxt.close()
Step 2
Now we need to retrieve n-th line from a text file into variable.
Create (or use existing) File technology; dataset with one string column only; temporary interface from this dataset (just to get select query). Run it, and retrieve statement from logs.
We would end up here with something like:
select C1
from TABLE
/*$$SNPS_START_KEYSNP$CRDWG_TABLESNP$CRTABLE_NAME=#HR.RID.txtSNP$CRLOAD_FILE=#HR.DIRECTORY/HR.FILENAME.txtSNP$CRFILE_FORMAT=DSNP$CRFILE_SEP_FIELD=0x0009SNP$CRFILE_SEP_LINE=0x000ASNP$CRFILE_FIRST_ROW=#HR.vLineNumberSNP$CRFILE_ENC_FIELD=SNP$CRFILE_DEC_SEP=SNP$CRSNP$CRDWG_COLSNP$CRCOL_NAME=C1SNP$CRTYPE_NAME=STRINGSNP$CRORDER=1SNP$CRLENGTH=500SNP$CRPRECISION=500SNP$CR$$SNPS_END_KEY*/
Notice numeric variable vLineNumber (initial value is 0; would be incremented by 1 at package flow) as a SNP$CRFILE_FIRST_ROW value, and SNP$CRFILE_SEP_LINE=0x000A
Use this statement as a Select Query at Refreshing view of then FILENAME variable.
Step 3
Create an interface with XML dataset as a source and generate scenario with RID, DIRECTORY, FILENAME variables. If you don’t see those variables at scenario generation time – carefully add something dummy like nvl2('#FILENAME','','').
Dive into new package flow. Here we would like to load files in parallel, hence execute scenario 8 times in background and wait for them before next 8 (supposed to use custom parallel-aware KMs).
In properties of ‘Execution of the Scenario’ step set “Asynchronous Mode” and keywords like HR_READ_XML_FILE_SCEN,#RID at General tab (the same keywords should be in ‘OdiWaitForChildSession’ step properties). At the Addition variables tab – HR RID #RID, and so on for FILENAME, DIRETORY.
We would also need another numeric variable like ‘vSemathore’.
And here is the final flow:
This note is based on articles by Cezar Santos at ODIExperts.com:
“How to refresh ODI variables from file – Part 1 – Just one value”;
“How to refresh ODI variables from file – Part 2 – Getting all lines, once at time”;
“Getting one or several unknown files from a directory”.
“How to refresh ODI variables from file – Part 1 – Just one value”;
“How to refresh ODI variables from file – Part 2 – Getting all lines, once at time”;
“Getting one or several unknown files from a directory”.
Hello,
ReplyDeleteI have a case which is more simpler than this, Im specifying the file_name when launching the package scenario I set jdbc
jdbc:snps:xml?f=#POC_SAMPLE.Xml_File_Name&s=#POC_Sample.Schema_Name
and Im passing these variables within the interface as NVL2('#_POC_SAMPLE.File_Name',null,null)...and in the package
still when I start my scenario the file is not being loaded i get a failure on the source connection and if I set the schema name as static I can only load one file if I try another I get an error that the temporary schema already exists
Any advise?
Thanks.