Reading Character-Delimited Files That Contain New-Line Characters

ApocY85 · Jan 28, 2010

I am writing a SAS program that must read in an Excel file. My team has Perl code that converts the Excel file to a CSV file that our SAS programs are able to read. Now, I've done this in the past with no problems. However, this time around the program must read in an Excel file with cell values that may contain new-line characters.

Although I still consider myself relatively new to SAS, I believe I have figured out the issue I am having. When an INPUT statement is called, an entire line is read into a buffer. It determines the end of each line by a new-line character. Since some of the cells in the Excel input file may contain new-line characters, an observation is split. When this occurs, the left-part of the observation is read in, but the right-part is read in on the next iteration as a seperate observation. Let me illustrate with example data, where '?' will represent a new-line character:

Example Excel File
============
Field 1 Field 2 Field 3 Field 4
-------------------------------------------------------------
415 Smith, John G troublesome?field orange
416 White, Bob M another value blue
417 Rogers, Bill J a third record green

If there is no new-line character in Field 3, normally the INPUT statement would read in all of the first obersvation ('415' through 'orange' values). However, since this Excel file may have a new-line character in the field, what happens is the INPUT statement reads in only part of the first observation ('415' through 'troublesome'). Then for the second iteration, the INPUT statement continues reading values from 'field' as if it has already reached the second line.

Is there any way I can read in entire observations whether there is a new-line character in a field or not, without having to modify the Perl code to remove the character? There has to be a way to let SAS know how many fields are in each observation and/or to ignore certain new-line characters. Any help will be very much appreciated!!! My current code is below, in case you would like to read/use it. Thanks, everyone

data file_data; infile csv_file dlm=&delimiter_cmd dsd truncover missover firstobs=2;
keep record_date_seconds record_date record_date_formatted;
input id:10. employee_name:$8. employee_id:$10. job_title:$50. location:$50.
location:$50. reporting_manager:$50. ticket_number:$15. record_date_seconds;
record_date = mdy(01,01,1900) + record_date_seconds - 2;
record_date_formatted = put(record_date, worddate18.);
run;
*****************************************************************;

ChrisW75 · Jan 28, 2010

Hi Apoc,
This is a problem you'll no doubt stumble over many times, it's a problem that's plagued me for years. As far as I can tell, there's no way to overcome this in SAS.
My method for dealing with this is to use a smart text editor (such as VEdit) to strip these characters out.
Generally the character that Excel puts in is a line feed character (referred to as LF or OA in Hex), so what I do is a global find and replace to remove ALL LF characters.
Then I do a second find and replace to replace of carriage return characters (CR or 0D in hex) with CRLF (0A0D) to restore the end of lines correctly.
This seems to work prety robustly.

However, if you have some decent PERL programmer there that can update the PERL program, it might be easier for them to amend the program to remove LF characters from the fields before writing them out. I'm afraid I can't help with that as I don't know PERL (beyond some exposure to regular expressions).

Sorry I can't be more help, if someone else has a better answer though, I'd love to hear it as I have to deal with this problem fairly frequently with data sent from clients.

Chris
Business Analyst, Code Monkey, Data Wrangler.
SAS Guru.

ApocY85 · Jan 28, 2010

Chris, I'm sorry to hear your frustrations, but at least I know I'm not the only person

For now, I've copied and updated the Perl program in a personal folder. I've never really programmed in Perl, but it appears my additional code is working. Unless someone else has another answer (that apparently will help both of us), perhaps I'll just end up overwriting the Perl program that is currently in use. Hopefully it won't take too long to have it overwritten (moved to production).

I'm still willing to hear any other suggestions. Thanks!

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Reading Character-Delimited Files That Contain New-Line Characters

ApocY85

Programmer

ChrisW75

Programmer

ApocY85

Programmer

Similar threads

Part and Inventory Search

Sponsor