Hi everyone,
I have a tricky problem with table lookups and implicit retaining of variable from those input vectors. We have a base table with several million records and >1,000 variables, the lookup tables are also in the region of several hundred K observations and several hundred variables.
Selected variables have to be updated, so the tables are looked up by their key and only variables involved are kept. After basic iorc correction an include file (>100,000 lines of code) is executed that in a matrix style selection process picks the correct assignment / calculation. These include files are maintained by a different part of the organization.
That's were the problem arises: as soon as a new variable is introduced in the include file which is not kept explicitly on the lookup table, potentially a previously read value from another lookup table is taken instead. Following a very simple example:
[tt]/*-- base table w/keys */
data base;
x = 1; output;
x = 2; output;
x = 3; output;
run;
/*-- variable lookup 1 */
data tree1 (index=(x));
x = 1; a1 = .; a2=1; output; /*-- missings allowed! */
x = 2; a1 = 1; a2=1; output;
x = 3; a1 = .; a2=1; output;
run;
/*-- variable lookup 2 */
data tree2 (index=(x));
x = 1; a2 = .; output;
x = 2; a2 = 2; output;
x = 3; a2 = 2; output;
run;
/*-- data enrichment */
data new1 (keep = x a1 b1)
new2 (keep = x a2 b2);
set base;
/*-- lookup #1 */
set tree1 (keep = x a1) key=x / unique;
if _iorc_ ne 0 then
_error_ = 0;
/*-- simulate business logic include file #1 */
b1 = a2;
/*-- lookup #2 */
set tree2 (keep = x a2) key=x / unique;
if _iorc_ ne 0 then
_error_ = 0;
/*-- simulate business logic include file #2 */
b2 = a2;
run;[/tt]
The resulting table work.new1 is the problem: for variable A2 I would expect all values to be 1, yet due to the implicit retain of the input vector from the 2nd lookup table that table's A2 variable is used, "delayed" by one observation:
[tt]OBS X A1 A2
1 1 . .
2 2 1 .
3 3 . 2[/tt]
Does anyone have an idea how to best identify these kind of issues? We are talking a huge number of datasteps where this kind of issue could be on, grown over a number of years...
Many thanks! Cheers, Matthias
I have a tricky problem with table lookups and implicit retaining of variable from those input vectors. We have a base table with several million records and >1,000 variables, the lookup tables are also in the region of several hundred K observations and several hundred variables.
Selected variables have to be updated, so the tables are looked up by their key and only variables involved are kept. After basic iorc correction an include file (>100,000 lines of code) is executed that in a matrix style selection process picks the correct assignment / calculation. These include files are maintained by a different part of the organization.
That's were the problem arises: as soon as a new variable is introduced in the include file which is not kept explicitly on the lookup table, potentially a previously read value from another lookup table is taken instead. Following a very simple example:
[tt]/*-- base table w/keys */
data base;
x = 1; output;
x = 2; output;
x = 3; output;
run;
/*-- variable lookup 1 */
data tree1 (index=(x));
x = 1; a1 = .; a2=1; output; /*-- missings allowed! */
x = 2; a1 = 1; a2=1; output;
x = 3; a1 = .; a2=1; output;
run;
/*-- variable lookup 2 */
data tree2 (index=(x));
x = 1; a2 = .; output;
x = 2; a2 = 2; output;
x = 3; a2 = 2; output;
run;
/*-- data enrichment */
data new1 (keep = x a1 b1)
new2 (keep = x a2 b2);
set base;
/*-- lookup #1 */
set tree1 (keep = x a1) key=x / unique;
if _iorc_ ne 0 then
_error_ = 0;
/*-- simulate business logic include file #1 */
b1 = a2;
/*-- lookup #2 */
set tree2 (keep = x a2) key=x / unique;
if _iorc_ ne 0 then
_error_ = 0;
/*-- simulate business logic include file #2 */
b2 = a2;
run;[/tt]
The resulting table work.new1 is the problem: for variable A2 I would expect all values to be 1, yet due to the implicit retain of the input vector from the 2nd lookup table that table's A2 variable is used, "delayed" by one observation:
[tt]OBS X A1 A2
1 1 . .
2 2 1 .
3 3 . 2[/tt]
Does anyone have an idea how to best identify these kind of issues? We are talking a huge number of datasteps where this kind of issue could be on, grown over a number of years...
Many thanks! Cheers, Matthias