Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Invalid data filter 1

Status
Not open for further replies.

Queryman

Programmer
Nov 4, 2002
243
US
I have a data set which could have invalid characters
I need to take every obs which has a invalid character like &! etc and put the whole obs in data set and the rest in another.
Here is what I have so far
proc sql ;
create table structures2 as
select trim(CLIENT_NAME||trim(SEGMENT)||trim(ANALYSIS_YEAR)||trim(STRUCTURE_CODE)||trim(INCLUDE)||
trim(OPERATIONAL_ID)||trim(ANALYST))
as FALIST
from structures ;
quit;

*MAKE TWO DATASETS, ONE WITH GOOD DATA & ONE WITH BAD DATA;

data good bad;
set WORK.structures2;
if (verify(FALIST,'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_-.'))
then output good;
else output bad;
run;



Michael

 
Michael,

Here is an untested code:

data good bad;
set WORK.structures2;
if compress(upcase(FALIST),'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_-.') = ''
then output good;
else output bad;
run;

I saw your question on SAS-L. If you read from an external field, You could do it in one data step.

data good bad;
infile "your file";
input Yourfields;
if compress(uppcase(_infile_), 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_-.') = ''
then output good;
else output bad;
run;



 
Works Great!, Thanks!

How can I make this look like the original dataset again. I need to send the file back to the user looking like the original structures dataset.

Something like
data WORK.structures3;

informat CLIENT_NAME $10. ;
informat SEGMENT $54. ;
informat ANALYSIS_YEAR $4. ;
informat STRUCTURE_CODE $8. ;
informat INCLUDE $7. ;
informat OPERATIONAL_ID $15. ;
informat ANALYST $25. ;
format CLIENT_NAME $10. ;
format SEGMENT $54. ;
format ANALYSIS_YEAR $4. ;
format STRUCTURE_CODE $8. ;
format INCLUDE $7. ;
format OPERATIONAL_ID $15. ;
format ANALYST $25. ;
set bad;
run;



Michael

 
Again untested:

data _null_;
infile "your input file";
file "your output good file";
input;
if verify(upcase(_infile_),'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_-.')
then put _infile_;
run;

Same thing for bad file, just add a NOT in front of verify.

You have to add delimiter in your good character list.
 
This works

data good bad;
set WORK.structures2;
if compress(upcase(FALIST),'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_-.') = ''
then output good;
else output bad;
run;

This does not work, it does not pick up the bad character

data _null_;
infile "c:\temp\structures.txt";
file "c:\temp\good_structures.txt";
input;
if verify(upcase(_infile_),'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_-.')
then put _infile_;
run;



Michael

 

_infile_ is your data in its origial format. You need you add delimiter in the good character list.

Please give details on what does not work.
 
Here is a sample of my data
CLIENT_NAME SEGMENT ANALYSIS_YEAR STRUCTURE_CODE INCLUDE OPERATIONAL_ID ANALYST
CONSUMERS ACTIVE ANY GROUP INCLUDE 6530511 WEHRLE
CONSUMERS& SALARIED ANY GROUP INCLUDE 6530511 WEHRLE


The last record should not go to good, since it has the & in the data
I am trying to do it in one step and create a .txt file with good data and then I will create one with bad data. Here is my code


data _null_;
infile 'c:\temp\structures.txt' delimiter='09'x MISSOVER DSD lrecl=32767 firstobs=2 ;
informat CLIENT_NAME $10. ;
informat SEGMENT $54. ;
informat ANALYSIS_YEAR $4. ;
informat STRUCTURE_CODE $8. ;
informat INCLUDE $7. ;
informat OPERATIONAL_ID $15. ;
informat ANALYST $25. ;
format CLIENT_NAME $10. ;
format SEGMENT $54. ;
format ANALYSIS_YEAR $4. ;
format STRUCTURE_CODE $8. ;
format INCLUDE $7. ;
format OPERATIONAL_ID $15. ;
format ANALYST $25. ;
input
CLIENT_NAME $
SEGMENT $
ANALYSIS_YEAR $
STRUCTURE_CODE $
INCLUDE $
OPERATIONAL_ID $
ANALYST $;
file "c:\temp\good_structures.txt";
if verify(upcase(_infile_),'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_-. ')
then put _infile_;
run;

Here is my log

NOTE: The infile 'c:\temp\structures.txt' is:
File Name=c:\temp\structures.txt,
RECFM=V,LRECL=32767

NOTE: The file "c:\temp\good_structures.txt" is:
File Name=c:\temp\good_structures.txt,
RECFM=V,LRECL=256

NOTE: 121 records were read from the infile 'c:\temp\structures.txt'.
The minimum record length was 38.
The maximum record length was 62.
NOTE: 121 records were written to the file "c:\temp\good_structures.txt".
The minimum record length was 38.
The maximum record length was 62.
NOTE: DATA statement used:
real time 0.04 seconds
cpu time 0.01 seconds





Michael

 
Do this:

data tst;
infile "c:\temp\structures.txt";
file "c:\temp\good_structures.txt";
input;
found = compress(upcase(_infile_),'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_-.');
if found = '' then put _infile_;
run;

Check variable "found" to see wether you included every character in your list.


 
The VERIFY function returns the position of the first character in source that is not present in any excerpt. If VERIFY finds every character in source in at least one excerpt, it returns a 0.


It looks like function Verify only works for letters. Try compress instead of verify.

BTW, You do not need to read indivual fields if you just want to output to an external file.

data _null_;
infile 'c:\temp\structures.txt' delimiter='09'x MISSOVER DSD lrecl=32767 firstobs=2 ;
input;
file "c:\temp\good_structures.txt";
if compress(upcase(_infile_),'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_-. ') = ''
then put _infile_;
run;
 
1067 data tst;
1068 infile "c:\temp\structures.txt";
1069 file "c:\temp\good_structures.txt";
1070 input;
1071 found = compress(upcase(_infile_),'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_-.');

1072 if found = '' then put _infile_;
Here is my new log
I am now getting really confused, the dataset tst has a lot of characters like little boxes for the variable found.


NOTE: The infile "c:\temp\structures.txt" is:
File Name=c:\temp\structures.txt,
RECFM=V,LRECL=256

NOTE: The file "c:\temp\good_structures.txt" is:
File Name=c:\temp\good_structures.txt,
RECFM=V,LRECL=256

NOTE: 122 records were read from the infile "c:\temp\structures.txt".
The minimum record length was 38.
The maximum record length was 79.
NOTE: 0 records were written to the file "c:\temp\good_structures.txt".
NOTE: The data set WORK.TST has 122 observations and 1 variables.
NOTE: DATA statement used:
real time 0.14 seconds
cpu time 0.09 seconds






Michael

 
Here is my log for your last posting, it did not place any records in the good file.


1082
1083 data _null_;
1084 infile 'c:\temp\structures.txt' delimiter='09'x MISSOVER DSD lrecl=32767 firstobs=2 ;

1085 input;
1086 file "c:\temp\good_structures.txt";
1087 if compress(upcase(_infile_),'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_-.') = ''

1088 then put _infile_;
1089 run;

NOTE: The infile 'c:\temp\structures.txt' is:
File Name=c:\temp\structures.txt,
RECFM=V,LRECL=32767

NOTE: The file "c:\temp\good_structures.txt" is:
File Name=c:\temp\good_structures.txt,
RECFM=V,LRECL=256

NOTE: 121 records were read from the infile 'c:\temp\structures.txt'.
The minimum record length was 38.
The maximum record length was 62.
NOTE: 0 records were written to the file "c:\temp\good_structures.txt".
NOTE: DATA statement used:
real time 0.02 seconds
cpu time 0.00 seconds




Michael

 
Your file is from another operating system. Those little boxes are spaces or tabs from another system.

Try this:

data _null_;
infile 'c:\temp\structures.txt' delimiter='09'x MISSOVER DSD lrecl=32767 firstobs=2 ;
informat CLIENT_NAME $10. ;
informat SEGMENT $54. ;
informat ANALYSIS_YEAR $4. ;
informat STRUCTURE_CODE $8. ;
informat INCLUDE $7. ;
informat OPERATIONAL_ID $15. ;
informat ANALYST $25. ;
format CLIENT_NAME $10. ;
format SEGMENT $54. ;
format ANALYSIS_YEAR $4. ;
format STRUCTURE_CODE $8. ;
format INCLUDE $7. ;
format OPERATIONAL_ID $15. ;
format ANALYST $25. ;
input
CLIENT_NAME $
SEGMENT $
ANALYSIS_YEAR $
STRUCTURE_CODE $
INCLUDE $
OPERATIONAL_ID $
ANALYST $;
file "c:\temp\good_structures.txt";
all = trim(CLIENT_NAME)||trim(SEGMENT)||trim(ANALYSIS_YEAR)||trim(STRUCTURE_CODE)|| trim (INCLUD
trim(OPERATIONAL_ID)|| ANALYST;

if compress(all,'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_-. ') = ''
then put _infile_;
run;
 
The part you sent worked with very minor changes for the good file, for the bad file, I changed the compress to not compress like so & it DID NOT WORK!!!

Here is my log

1222 data _null_;
1223 infile 'c:\temp\structures.txt' delimiter='09'x MISSOVER DSD lrecl=32767 firstobs=2 ;

1224 informat CLIENT_NAME $10. ;
1225 informat SEGMENT $54. ;
1226 informat ANALYSIS_YEAR $4. ;
1227 informat STRUCTURE_CODE $8. ;
1228 informat INCLUDE $7. ;
1229 informat OPERATIONAL_ID $15. ;
1230 informat ANALYST $25. ;
1231 format CLIENT_NAME $10. ;
1232 format SEGMENT $54. ;
1233 format ANALYSIS_YEAR $4. ;
1234 format STRUCTURE_CODE $8. ;
1235 format INCLUDE $7. ;
1236 format OPERATIONAL_ID $15. ;
1237 format ANALYST $25. ;
1238 input
1239 CLIENT_NAME $
1240 SEGMENT $
1241 ANALYSIS_YEAR $
1242 STRUCTURE_CODE $
1243 INCLUDE $
1244 OPERATIONAL_ID $
1245 ANALYST $;
1246 file "c:\temp\bad_structures.txt";
1247 all =
1247! trim(CLIENT_NAME||trim(SEGMENT)||trim(ANALYSIS_YEAR)||trim(STRUCTURE_CODE)||trim(INCLUDE)|
1247! |
1248 trim(OPERATIONAL_ID)||trim(ANALYST)) ;
1249 if not compress(all,'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_-. ') = ''
1250 then put _infile_;
1251 run;

NOTE: Character values have been converted to numeric values at the places given by:
(Line):(Column).
1249:8 1249:67
NOTE: The infile 'c:\temp\structures.txt' is:
File Name=c:\temp\structures.txt,
RECFM=V,LRECL=32767

NOTE: The file "c:\temp\bad_structures.txt" is:
File Name=c:\temp\bad_structures.txt,
RECFM=V,LRECL=256

NOTE: Invalid numeric data, '&' , at line 1249 column 8.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+-

10 CHAR CONSUMERS&.SALARIED.ANY.GROUP.INCLUDE.6530511.WEHRLE 52
ZONE 4445544552054445444044504545504444544033333330544544
NUMR 3FE35D5236931C1295491E9972F5099E3C5459653051197582C5
CLIENT_NAME=CONSUMERS& SEGMENT=SALARIED ANALYSIS_YEAR=ANY STRUCTURE_CODE=GROUP INCLUDE=INCLUDE
OPERATIONAL_ID=6530511 ANALYST=WEHRLE all=CONSUMERS&SALARIEDANYGROUPINCLUDE6530511WEHRLE
_ERROR_=1 _INFILE_=CONSUMERS& SALARIED ANY GROUP INCLUDE 6530511 WEHRLE _N_=9
NOTE: 121 records were read from the infile 'c:\temp\structures.txt'.
The minimum record length was 38.
The maximum record length was 62.
NOTE: 0 records were written to the file "c:\temp\bad_structures.txt".
NOTE: DATA statement used:
real time 0.04 seconds
cpu time 0.03 seconds




Michael

 
I jumped the gun, I had to do this to get it to work

data _null_;
infile 'c:\temp\structures.txt' delimiter='09'x MISSOVER DSD lrecl=32767 firstobs=2 ;
informat CLIENT_NAME $10. ;
informat SEGMENT $54. ;
informat ANALYSIS_YEAR $4. ;
informat STRUCTURE_CODE $8. ;
informat INCLUDE $7. ;
informat OPERATIONAL_ID $15. ;
informat ANALYST $25. ;
format CLIENT_NAME $10. ;
format SEGMENT $54. ;
format ANALYSIS_YEAR $4. ;
format STRUCTURE_CODE $8. ;
format INCLUDE $7. ;
format OPERATIONAL_ID $15. ;
format ANALYST $25. ;
input
CLIENT_NAME $
SEGMENT $
ANALYSIS_YEAR $
STRUCTURE_CODE $
INCLUDE $
OPERATIONAL_ID $
ANALYST $;
file "c:\temp\bad_structures.txt";
all = trim(CLIENT_NAME||trim(SEGMENT)||trim(ANALYSIS_YEAR)||trim(STRUCTURE_CODE)||trim(INCLUDE)||
trim(OPERATIONAL_ID)||trim(ANALYST)) ;
if compress(all,'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_-. ') NE ''
then put _infile_;
run;



Michael

 
Michael,

Compress function removes specific characters from a character string and keep the rest.

If there is nothing left, it is good record. Otherwise it is bad. That is why you can not use not compress.

There is no need for all the informats and formats in your data step.

data _null_;
infile 'c:\temp\structures.txt' delimiter='09'x MISSOVER DSD lrecl=32767 firstobs=2 ;
input
CLIENT_NAME :$10.
SEGMENT :$54.
ANALYSIS_YEAR :$
STRUCTURE_CODE :$
INCLUDE :$
OPERATIONAL_ID :$15.
ANALYST :$25.;

:
allows you to specify an informat that the INPUT statement uses to read the variable value. This format modifier reads the value from the next nonblank column until the pointer reaches the next blank column, the defined length of the variable, or the end of the data line, whichever comes first.

Look it up in your SAS base manual for INPUT statment and list input.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top