Sunday, April 23, 2017

Loading files with dynamic names in ODI

Suppose we have to load files named like ‘HR_YYYYMMDD.XML’, where YYYYMMDD represents the date of the file formation which values are differentiate within executions.

We would pass variable as a filename in Topology which will be set dynamically depending on start-up parameters.

Before we begin, make sure you have already created appropriate schema in Topology and reverse engineered required data models from these files.

Step 1
Create a variables ‘FILENAME’, ‘DIRECTORY’ in a project named HR and pass them into JDBC Url of data server in Topology.
For instance:
jdbc:snps:xml?f=#HR.DIRECTORY/#HR.FILENAME&re=Employees&ro=true&case_sens=true&dod=true&standalone=true

Create a variables for date bounds: ‘DATE_BEGIN’, ‘DATE_END’, and for run id ‘RID’.

Create a new procedure which would list files in directory, select ones with names between specified date range, and write those lines into text file.
We would choose Jython as a Target Technology for this task.
In the Details view of a procedure add a new step, for example:
import sys
import os
import fnmatch
from datetime import date, timedelta, datetime

date_begin = datetime.strptime('#DATE_BEGIN', '%Y%m%d')
date_end   = datetime.strptime('#DATE_END', '%Y%m%d')
date_delta = date_end - date_begin

dates = []
for i in range(date_delta.days + 1):
    dt = date_begin + timedelta(days=i)
    dates.append(dt.strftime("%Y%m%d"))

filenames = []
for f in os.listdir('#DIRECTORY'):
    for dt in dates:
        if fnmatch.fnmatch(f, 'HR_*'+dt+'*.XML'):
            filenames.append(f)

ftxt =  open('#DIRECTORY/#RID.txt','w')
for fname in filenames:
    ftxt.write(fname+'\n')
ftxt.close()

Step 2
Now we need to retrieve n-th line from a text file into variable.

Create (or use existing) File technology; dataset with one string column only; temporary interface from this dataset (just to get select query). Run it, and retrieve statement from logs.

We would end up here with something like:
select C1
from TABLE
/*$$SNPS_START_KEYSNP$CRDWG_TABLESNP$CRTABLE_NAME=#HR.RID.txtSNP$CRLOAD_FILE=#HR.DIRECTORY/HR.FILENAME.txtSNP$CRFILE_FORMAT=DSNP$CRFILE_SEP_FIELD=0x0009SNP$CRFILE_SEP_LINE=0x000ASNP$CRFILE_FIRST_ROW=#HR.vLineNumberSNP$CRFILE_ENC_FIELD=SNP$CRFILE_DEC_SEP=SNP$CRSNP$CRDWG_COLSNP$CRCOL_NAME=C1SNP$CRTYPE_NAME=STRINGSNP$CRORDER=1SNP$CRLENGTH=500SNP$CRPRECISION=500SNP$CR$$SNPS_END_KEY*/

Notice numeric variable vLineNumber (initial value is 0; would be incremented by 1 at package flow) as a SNP$CRFILE_FIRST_ROW value, and SNP$CRFILE_SEP_LINE=0x000A

Use this statement as a Select Query at Refreshing view of then FILENAME variable.

Step 3
Create an interface with XML dataset as a source and generate scenario with RID, DIRECTORY, FILENAME variables. If you don’t see those variables at scenario generation time – carefully add something dummy like nvl2('#FILENAME','','').  

Dive into new package flow. Here we would like to load files in parallel, hence execute scenario 8 times in background and wait for them before next 8 (supposed to use custom parallel-aware KMs).

In properties of ‘Execution of the Scenario’ step set “Asynchronous Mode” and keywords like HR_READ_XML_FILE_SCEN,#RID at General tab (the same keywords should be in ‘OdiWaitForChildSession’ step properties). At the Addition variables tab – HR RID #RID, and so on for FILENAME, DIRETORY.

We would also need another numeric variable like ‘vSemathore’.

And here is the final flow:



Sunday, January 22, 2017

Working in Data Modeler

This note is focused on the specific case of engineering to relational model from logical using Oracle SQL Developer Data Modeler.

Logical model

Suppose we are making a data mart constellation schema for our stock.
So far we have this logical model with mostly just entities, attributes, comments and relations:

The color variation comes from choosing Classification Type within Properties for each entity: Fact or Dimension. We would like to preserve this differentiation in entity name prefix as well, so we set up the Short Name: for Date dimension it would be D001, D002 for Time.
With a lot of entities you may find this Custom Transformations Script for logical object useful:
 1 //
 2 //Sets table abbr to D00., F00., depending on Classification type name
 3 //
 4 
 5 max_num_len = 3; //F[...]
 6 var cur_num = []; //array for current value for each type
 7 
 8 //replacement for String.format("%04",value);
 9 function formatCount(len, value){
10     strval = "" + value;
11     ret = "";
12     for ( var i = 0; i < len - strval.length; i++ ){
13         ret = ret + "0";
14     }
15     ret = ret + value;
16     return ret;
17 }
18 
19 function existTableWithAbbr(abbr){
20     tables = model.getEntitySet().toArray();
21     for ( var t = 0; t < tables.length; t++ ){
22         if ( tables[t].getShortName() == abbr ){
23             return true;
24         }
25     }
26     return false;
27 }
28 
29 //Main loop
30 tables = model.getEntitySet().toArray();
31 for ( var t = 0; t < tables.length; t++ ){
32     table = tables[t];
33     name = table.getName();
34     classType = table.getClassificationType();
35     if ( table.getShortName() == "" && classType != null ){
36         typeId = classType.getTypeID();
37 
38         if ( cur_num[typeId] == null ) { // initialization
39             cur_num[typeId] = 1;
40         }
41 
42         do {
43             prefix = classType.getTypeName().substring(0,1)
44                 + formatCount(max_num_len,cur_num[typeId]);
45             cur_num[typeId] ++ ;
46         } while ( existTableWithAbbr(prefix) );
47 
48         table.setShortName(prefix);
49         table.setDirty(true);
50     }
51 
52 }

Transformations menu is under Tools → Design Rules And Transformations.

Noticed many links to same dimension from the Income order fact? Each of them have different business mining and would be further separated on virtual data mart layer. 
To give different names to these future foreign-key columns in logical model you may: manually add Primary UID attribute to dimensions and choose Bachman Notation from right-click canvas menu. Rename appeared attributes.


On the figure above you see that many attributes are of custom Domain type called %_TMPL. They would be spitted depending on this Domain type name and attributes of corresponding entity.
Lets create these entities:


Don’t forget to comment template attributes like ‘Code’, ‘Descriptions’ and to-be-spitted attributes like ‘of the article’.

The last step here is to customize naming templates. Right click on your design → Properties → Templates under Settings/Naming Standard:
Foreign Key to {child abbr}_{parent abbr}_FK 
Column Foreign Key to {ref table}
Primary and Surrogate Keys to {table abbr}_{table}_PK
Surrogate Key Column to {table}_ID

Right click on Logical Model → Engineer to Relational Model. At General Options tab disable Apply naming translation. Click the Engineer button.

Relational model

Now, use Apply Naming Standards on Relational_1 for everything except Column Foreign Key – we would rename them manually.


Next we would like to:
  • Uppercase table and constraint names;
  • Reorder columns – keys to beginning;
  • Split %_TMPL columns;
  • Copy comments to Comments in RDBMS;
  • Set primary key names to {table name}_ID;
  • Set table names to {table abbr}_{table};
  • Add appropriate prefix to fact column names.

For the first two and fourth tasks there are preinstalled scripts, and the other ones may be automated like so:
Split %_TMPL columns
 1 //
 2 //For example, would replace column "STATUS", which domain name is "CODE/NAME_TMPL", 
 3 //with a copy of "CODE/NAME_TMPL"(code, name) tables' columns
 4 //so there would be "STATUS_code" and "STATUS_name"
 5 //
 6 
 7 TMPL_POSTFIX = "_TMPL";
 8 
 9 tables = model.getTableSet().toArray();
10 for ( var t = 0; t < tables.length; t++ ){
11     table = tables[t];
12     columns = table.getElements();
13     for ( var c = 0; c < columns.length; c++ ){
14         column = columns[c];
15         domname = column.getDomain().getName(); // template table name
16         if ( domname.indexOf(TMPL_POSTFIX) == domname.length - TMPL_POSTFIX.length ){
17             template = model.getTableSet().getByName(domname);
18             if ( template != null ){
19                 prefix = column.getName();
20                 orig_comment = column.getComment();
21                 column.remove();
22                 tcolumns = template.getElements();
23                 for ( var tc = 0; tc < tcolumns.length; tc++ ){
24                     col = table.createColumn();
25                     tcol = tcolumns[tc];
26                     tcol.copy(col);
27                     col.setName(prefix + "_" + tcol.getName());
28                     col.setComment(col.getComment()+" "+orig_comment);
29                 }
30                 table.setDirty(true);
31             }
32         }
33     }
34 }

Set primary key names to {table name}_ID
 1 tables = model.getTableSet().toArray();
 2 for ( var t = 0; t < tables.length; t++ ){
 3     table = tables[t];
 4     name = table.getName();
 5     columns = table.getElements();
 6     for ( var c = 0; c < columns.length; c++ ){
 7         column = columns[c];
 8         if ( column.isPKElement() ){
 9             column.setName(name+'_ID');
10             table.setDirty(true);
11         }
12     }
13 }

Set table names to {table abbr}_{table}
 1 tables = model.getTableSet().toArray();
 2 for ( var t = 0; t < tables.length; t++ ){
 3     table = tables[t];
 4     name = table.getName();
 5     abbr = table.getAbbreviation();
 6     if ( abbr != "" ) {
 7         table.setName(table.getAbbreviation()+"_"+name);
 8         table.setDirty(true);
 9     }
10 }

Add appropriate prefix to fact column names
 1 //
 2 //Sets Fact tables' column name prefix like F001_, D003_ 
 3 //
 4 
 5 tables = model.getTableSet().toArray();
 6 for ( var t = 0; t < tables.length; t++ ){
 7     table = tables[t];
 8     classType = table.getClassificationType();
 9     if ( classType != null ){
10         if ( classType.getTypeName().equals("Fact") ){
11             prefix = table.getAbbreviation()
12             if ( ! prefix.equals("") ) {
13                 columns = table.getElements();
14                 for ( var c = 0; c < columns.length; c++ ){
15                     column = columns[c];
16                     if ( ! column.getName().substring(0,2).equals("DD") ){
17                         if ( ! column.isFKElement()  ){
18                                     column.setName(prefix+'_'+column.getName());
19                         }
20                         else {
21                             columnRefTableAbbr = column.getFKIndexAssociation().getOriginalContainer().getAbbreviation();
22                             column.setName(columnRefTableAbbr+'_'+column.getName()+'_ID');
23                         }
24                         table.setDirty(true);
25                     }
26                 }
27             }
28         }
29     }
30 }

And the result:


If this would be the real schema you will continue with unique constraints on dimensions, composite keys on facts and so on.
Anyway, now you have some ideas on how to get basic tasks done efficiently!