Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Data Scrub, within comma delimited record 1

Status
Not open for further replies.

SiouxCityElvis

Programmer
Jun 6, 2003
228
US
I am running RMCOBOL-85 on Linux.

I am reading in comma delimited file records. I am not allowed to have our clients send us the data any different than how they currently do.
The challenge is that when they take their excel spreadsheet, save as .csv, fields that have commas within them will put quotes around that field.


Example of record I receive within a comma delimited file. And I have to handle this scrubbing of data within my COBOL code.

Bill Smith,444-55-6666,999999999,1234567,"$1,060.46 "

So, my question is, how am I to UNSTRING BLAH-REC DELIMITED BY ","
when I run into the trouble of "," within a field?

Also, I need to scrub the "$" out.
The fields vary in length, so I'm thinking I can't just substring the record and parse it into different fields then scrub the fields individually.
There's going to have to be INSPECT statements somewhere in the creative approach to this I imagine, but beyond that, I'm stumped.
Thanks.
-David
 
David,
Bearing in mind the fact you have got to remove some of the text, it might be easier to consider examining each line character by character and splitting the line yourself. This way you could turn an IGNORE-COMMA-FLAG on everytime you hit and apostrophe, and turn it off as you reach the match of the pair. You could get rid of the $ and any other characters along the way.

You could do this using PERFORMs and the reference modification thingy which enables you to move along the line byte by byte.

These are just my first thoughts, and there may be a better way, plus I'm not sure how efficient this would be.

Marc
 
I try to avoid situations like this by using something other than comma as the delimiter (say a bar "|"). Given you can't do that, I might approach the problem by unstringing only one field at a time, something like:
Code:
IF WS-IN(1:1) = QUOTE
     UNSTRING WS-IN(2:)
         INTO WS-FIELD
              WS-REST DELIMITED BY QUOTE
ELSE
     UNSTRING WS-IN
         INTO WS-FIELD
              WS-REST DELIMITED BY ","
END-IF

Process WS-FIELD and then MOVE WS-REST TO WS-IN and repeat as needed. Processing amounts with the currency sign can be handled similarly: use reference modification to drop the sign and NUMVAL to convert the resulting string to a numeric value (for instance).

Using pointers might be another approach that would be useful or could be used in conjunction with the above; hard to say without coding up the whole thing :)

Regards.

Glenn
 
Thanks.
Yeah, when the data is handled here in house, we delimited with a "|". Unfortunately, the boss wants to not have to make the clients do any extra footwork whatsoever other than save as a .csv file and put it out on the internet printer for me to grab.

 
how about ...

Code:
78 max-fields value 5.
01 ws-fields.
   05 ws-field  occurd max-fields times pic x(100).

unstring input-record
  into ws-field(1),
       ws-field(2),
       ws-field(3),
       ws-field(4),
       ws-field(5)
       delimited by ","
end-unstring

perform varying ws-ctr from 1 by 1 until ws-ctr > max-fields
  move zeros to comma-ctr
  inspect field(ctr) tallying comma-ctr for characters before leading ","
  if comma-ctr < 100
    inspect field(ctr) replacing all '"' by spaces
  end-if
  inspect field(ctr) replacing all '$' by spaces
end-perform
 
Another soluction could be trying to INSPECT the input file before UNSTRING and then, proceed to UNSTRING the input field in your result fields replacing the unwanted chareacters. ;)


Have a nice day....
 
My 2 cents would be to copy to the record into a hold area in working storage one character at a time. replace any comma by a bar and any $ by a space. If you come across a "
make a condition true. when you find a comma test for the condition to be true if it is then leave the comma alone.
when you come across a " and the condition is true then make the condition false. Then juse unstring the hold area by "|".

Psuedo code

01 cond value false.
01 sub pic 999 value zeroes.
01 sub2 pic 999 value zeroes.

loop.
add 1 to sub.
if sub exceeds some number then exit the loop.
if char(sub:1) = '"' and condition true
then set condition false.
if char(sub:1) = '"' and condition false
then set condition true.
if char(sub:1) = ',' and condition true
next sentence else
move "|" to char(sub:1).

if char(sub:1) = '$' go to loop.

add 1 to sub2.
move char1(sub:1) to hold(sub2:1).
go to loop
 
David,

This is a continuation from your previous thread thread209-665289.

In that thread I had some suggested code that I modify below.

Code:
       identification division.
       program-id.  unstring-fields.
       data division.
       working-storage section.
       01  binary.
           02  I               PIC S9(4).
           02  J               PIC S9(4).
       01  INPUT-FIELD  PIC X(123) VALUE 
           'Bill Smith,444-55-6666,999999999,1234567,"$1,060.46 "'.
       78  INPUT-FIELD-SIZE value LENGTH OF INPUT-FIELD.

       01  .
           02  OCCURS 40.
               03  FIELD-HOLDER PIC X(50).
               03  FIELD-LENGTH PIC 9(4).
       procedure division.
       a.
           MOVE 1 to I.
           MOVE 1 TO J.
           PERFORM UNTIL I > INPUT-FIELD-SIZE
               INSPECT INPUT-FIELD (I:) TALLYING I FOR LEADING SPACE
               IF I NOT > INPUT-FIELD-SIZE
                   EVALUATE INPUT-FIELD (I:1)
                   WHEN '"'
                       ADD 1 TO I
                       IF I NOT > INPUT-FIELD-SIZE
                           UNSTRING INPUT-FIELD DELIMITED BY QUOTE
                               INTO FIELD-HOLDER (J)
                               COUNT IN FIELD-LENGTH (J)
                               POINTER I
                           END-UNSTRING
                           DISPLAY J convert, "=>" 
                               FIELD-HOLDER (J) (1: FIELD-LENGTH (J))
                               "<="
                           ADD 1 to J
                       END-IF
                   WHEN OTHER
                       UNSTRING INPUT-FIELD DELIMITED BY ","
                           INTO FIELD-HOLDER (J)
                           COUNT IN FIELD-LENGTH (J)
                           POINTER I
                       END-UNSTRING
                       DISPLAY J convert, "=>" 
                           FIELD-HOLDER (J) (1: FIELD-LENGTH (J))
                           "<="
                       ADD 1 to J
                   END-EVALUATE
               END-IF
           END-PERFORM.
           SUBTRACT 1 FROM J.
           DISPLAY "Final J=", J CONVERT.
           stop run.

Here is the output:
Code:
1    =>Bill Smith<=
2    =>444-55-6666<=
3    =>999999999<=
4    =>1234567<=
5    =>$1,060.46 <=
Final J=5

You have already discovered how to de-edit numeric values.

Note that I have made the simplifying assumption that a quoted field contains no embedded quotes.

Enjoy!

Tom Morrison
 
Oops. Forgot to scan for comma after terminating quote mark.
Code:
       identification division.
       program-id.  unstring-fields.
       data division.
       working-storage section.
       01  binary.
           02  I               PIC S9(4).
           02  J               PIC S9(4).
       01  INPUT-FIELD  PIC X(123) VALUE 
           'Bill Smith,"444-55-6666",999999999,1234567,"$1,060.46 "'.
       78  INPUT-FIELD-SIZE value LENGTH OF INPUT-FIELD.

       01  .
           02  OCCURS 40.
               03  FIELD-HOLDER PIC X(50).
               03  FIELD-LENGTH PIC 9(4).
       procedure division.
       a.
           MOVE 1 to I.
           MOVE 1 TO J.
           PERFORM UNTIL I > INPUT-FIELD-SIZE
               INSPECT INPUT-FIELD (I:) TALLYING I FOR LEADING SPACE
               IF I NOT > INPUT-FIELD-SIZE
                   EVALUATE INPUT-FIELD (I:1)
                   WHEN '"'
                       ADD 1 TO I
                       IF I NOT > INPUT-FIELD-SIZE
                           UNSTRING INPUT-FIELD DELIMITED BY QUOTE
                               INTO FIELD-HOLDER (J)
                               COUNT IN FIELD-LENGTH (J)
                               POINTER I
                           END-UNSTRING
                           DISPLAY J convert, "=>" 
                               FIELD-HOLDER (J) (1: FIELD-LENGTH (J))
                               "<="
                           ADD 1 to J
                           PERFORM VARYING I FROM I by 1
                                     UNTIL I > INPUT-FIELD-SIZE
                                        OR INPUT-FIELD (I:1) = ","
                               CONTINUE
                           END-PERFORM
                           ADD 1 TO I
                       END-IF
                   WHEN OTHER
                       UNSTRING INPUT-FIELD DELIMITED BY ","
                           INTO FIELD-HOLDER (J)
                           COUNT IN FIELD-LENGTH (J)
                           POINTER I
                       END-UNSTRING
                       DISPLAY J convert, "=>" 
                           FIELD-HOLDER (J) (1: FIELD-LENGTH (J))
                           "<="
                       ADD 1 to J
                   END-EVALUATE
               END-IF
           END-PERFORM.
           SUBTRACT 1 FROM J.
           DISPLAY "Final J=", J CONVERT.
           stop run.

Tom Morrison
 
If you are assured that the terminating quote will be followed by the field separating comma without intervening spaces, then this is a bit more elegant.
Code:
       identification division.
       program-id.  unstring-fields.
       data division.
       working-storage section.
       01  binary.
           02  I               PIC S9(4).
           02  J               PIC S9(4).
       01  the-delimiter       PIC X(3).             
       01  INPUT-FIELD  PIC X(123) VALUE 
           'Bill Smith,"444-55-6666",999999999,1234567,"$1,060.46 "'.
       78  INPUT-FIELD-SIZE value LENGTH OF INPUT-FIELD.

       01  .
           02  OCCURS 40.
               03  FIELD-HOLDER PIC X(50).
               03  FIELD-LENGTH PIC 9(4).
       procedure division.
       a.
           MOVE 1 to I.
           MOVE 1 TO J.
           PERFORM UNTIL I > INPUT-FIELD-SIZE
               INSPECT INPUT-FIELD (I:) TALLYING I FOR LEADING SPACE
               IF I NOT > INPUT-FIELD-SIZE
                   EVALUATE INPUT-FIELD (I:1)
                   WHEN '"'
                       ADD 1 TO I
                       IF I NOT > INPUT-FIELD-SIZE
                           UNSTRING INPUT-FIELD 
                               DELIMITED BY '",' OR '"'
                               INTO FIELD-HOLDER (J)
                               DELIMITER IN THE-DELIMITER
                               COUNT IN FIELD-LENGTH (J)
                               POINTER I
                           END-UNSTRING
                           DISPLAY J convert, "=>" 
                               FIELD-HOLDER (J) (1: FIELD-LENGTH (J))
                               "<= terminated by ", THE-DELIMITER
                           ADD 1 to J
                       END-IF
                   WHEN OTHER
                       UNSTRING INPUT-FIELD DELIMITED BY ","
                           INTO FIELD-HOLDER (J)
                           DELIMITER IN THE-DELIMITER
                           COUNT IN FIELD-LENGTH (J)
                           POINTER I
                       END-UNSTRING
                       DISPLAY J convert, "=>" 
                           FIELD-HOLDER (J) (1: FIELD-LENGTH (J))
                           "<= terminated by ", THE-DELIMITER
                       ADD 1 to J
                   END-EVALUATE
               END-IF
           END-PERFORM.
           SUBTRACT 1 FROM J.
           DISPLAY "Final J=", J CONVERT.
           stop run.

Tom Morrison
 
Hi,

Have you tried:

DELIMITED BY ',' OR ',"' OR OR '",' ',"$'

You may also want to ref/mod the sending field on the outside chance that the 1st string in the field is framed in '"'s, e.g.:
Code:
MOVE +1    TO SPOS
IF SEND-FLD(1:1) = '"'
   MOVE +2 TO SPOS
END-IF
UNSTRING SEND-FLD(SPOS:) .....
BTW, take a close look at your I/P. I didn't see any reference to negative values (maybe there are none). If your data does contain neg values, that can be problematic using the de-edit methodology you propose. I'm not sure if a pos value requires a space before the value as a neg value requires a minus sign (-).

Point 2: IICR, some fields showing values like 123.12 also show a value of 200 or 10, etc. I don't know if that value will de-edit properly using the same PIC as 123.12.

Regards, Jack.
 
Jack,

On RM/COBOL the de-edit is quite generous and actually does what one expects. Here is the quote from the RM/COBOL Language Reference Manual (MOVE Statement):
The implied conversion deletes all characters other than the decimal digits 0, 1, . . . 9, sets the operational sign negative if a minus sign is present in the sending item or positive otherwise, and sets the scale according to the rightmost decimal point present in the sending item or to the scale of the sending data item otherwise. The representation of the decimal point used in this conversion is a period unless the DECIMAL POINT IS COMMA clause is specified in the source program, in which case a comma is used. In this conversion, any decimal digit 0 that matches an inserted character 0 in the sending item is excluded from the resulting unedited numeric value.

Tom Morrison
 
Hi,

I should have mentioned this in my previous post.

I'm not sure how the delim search is conducted, so it might be prudent to order them by descending length, e.g.:

DELIMITED BY ',"$' OR ',"' OR OR '",' OR ','

Jack
 
Jack,

The standard requires that the delimiter are tested in left-to-right order. It is important, therefore, that one test for [tt]'",'[/tt] before (to the left of) [tt]'"'[/tt]. If the order were reversed, one would never find the longer delimiter.

Tom Morrison
 
Thanks all.

Here's what I ended up using..

Code:
Procedure division.
..
..
           PERFORM UNTIL EOF-CPFILE
             MOVE SPACES TO WS-FILE-RECORD
             READ CSV-FILE NEXT RECORD
               AT END
                 IF DISPLAY-SW = "Y"
debug            DISPLAY "DONE WITH RAW DATA..."
                 END-IF
                 MOVE "Y" TO SW-EOF-CPFILE
.....
               NOT AT END
                MOVE "N" TO SW-EOF-CPFILE
                 MOVE FILE-RECORD TO WS-FILE-RECORD
                 MOVE 0 TO DASH-CTR, QUOTE-CTR
                 INSPECT WS-FILE-RECORD TALLYING DASH-CTR FOR ALL "-"
                 INSPECT WS-FILE-RECORD TALLYING QUOTE-CTR FOR ALL '"'
debug            DISPLAY "DASH CTR: " DASH-CTR
debug            DISPLAY "QUOTE CTR: " QUOTE-CTR
debug *******    ACCEPT MENU-PROMPT
                 IF DASH-CTR > 1
                    PERFORM SCRUB-QUOTE-FIELDS
                 ELSE
                    DISPLAY "not a trans rec - no ssn dashes!"
                 END-IF
             END-READ
           END-PERFORM.
-----------------------------------------     
       SCRUB-QUOTE-FIELDS.
           MOVE 0 TO WS-CTR-A, WS-SPACE-CTR,
                               WS-WORD-START,
                               CHAR-CTR,
                               FIELD-COUNT.
           MOVE "N" TO SW-EOF-RECORD-SCRUB.

           INSPECT WS-FILE-RECORD TALLYING WS-SPACE-CTR
                   FOR LEADING SPACES.
           ADD 1 TO WS-SPACE-CTR GIVING WS-WORD-START.
           PERFORM UNTIL EOF-RECORD-SCRUB
             ADD 1 TO FIELD-COUNT
             IF WS-FILE-RECORD(WS-WORD-START:1) NOT = '"'
              INSPECT WS-FILE-RECORD(WS-WORD-START:) TALLYING CHAR-CTR
                     FOR CHARACTERS BEFORE INITIAL ","
debug ********DISPLAY "FIELD: " WS-FILE-RECORD(WS-WORD-START:CHAR-CTR)
              PERFORM MOVE-TO-WS-FIELD
              ADD CHAR-CTR TO WS-WORD-START
              ADD 1 TO WS-WORD-START
              MOVE 0 TO CHAR-CTR
             ELSE
debug ********DISPLAY "First char is quote!"
              ADD 1 TO WS-WORD-START
              INSPECT WS-FILE-RECORD(WS-WORD-START:) TALLYING CHAR-CTR
                     FOR CHARACTERS BEFORE INITIAL '"'
debug ********DISPLAY "FIELD: " WS-FILE-RECORD(WS-WORD-START:CHAR-CTR)
              PERFORM MOVE-TO-WS-FIELD
              ADD CHAR-CTR TO WS-WORD-START
              ADD 2 TO WS-WORD-START
              MOVE 0 TO CHAR-CTR
             END-IF
             IF FIELD-COUNT = 5
               MOVE "Y" TO SW-EOF-RECORD-SCRUB
               DISPLAY "NAME: " WS-CSV-NAME
               DISPLAY "SSN: " WS-CSV-SSN

               DISPLAY "RTN: " WS-CSV-RTN
               DISPLAY "ACCT: " WS-CSV-ACCT
               DISPLAY "AMT: " WS-CSV-AMOUNT
             END-IF
           END-PERFORM.
debug *****ACCEPT MENU-PROMPT.
  ------------------------------------------------
       MOVE-TO-WS-FIELD.
           IF FIELD-COUNT = 1
             MOVE WS-FILE-RECORD(WS-WORD-START:CHAR-CTR) TO
                                                  WS-CSV-NAME
           END-IF.
           IF FIELD-COUNT = 2
             MOVE WS-FILE-RECORD(WS-WORD-START:CHAR-CTR) TO
                                                  WS-CSV-SSN
           END-IF.
           IF FIELD-COUNT = 3
             MOVE WS-FILE-RECORD(WS-WORD-START:CHAR-CTR) TO
                                                  WS-CSV-RTN
           END-IF.
           IF FIELD-COUNT = 4
             MOVE WS-FILE-RECORD(WS-WORD-START:CHAR-CTR) TO
                                                  WS-CSV-ACCT
           END-IF.
           IF FIELD-COUNT = 5
             PERFORM SCRUB-AMOUNT-FIELD
             MOVE WS-FILE-RECORD(WS-WORD-START:CHAR-CTR) TO
                                                  WS-CSV-AMOUNT
             MOVE WS-CSV-AMOUNT TO NUMERIC-INTERMEDIATE-STEP
             MOVE NUMERIC-INTERMEDIATE-EDIT TO WS-PAY-AMT-RFMT-B
             MOVE WS-PAY-AMT-RFMT-B TO WS-PAY-AMT-RFMT-C
newww        PERFORM GET-EMPDB-RECORD
           END-IF.

---------------------------------------------------
       SCRUB-AMOUNT-FIELD.
           MOVE 0 TO DOLLAR-CTR.
debug *****DISPLAY "AMT BEFORE: " WS-FILE-RECORD(WS-WORD-START:CHAR-CTR).
           INSPECT WS-FILE-RECORD(WS-WORD-START:CHAR-CTR)
               REPLACING ALL '$' BY ' '.
debug *****DISPLAY "AMT AFTER: " WS-FILE-RECORD(WS-WORD-START:CHAR-CTR).
debug *****ACCEPT MENU-PROMPT.

Seems to work just fine.
My boss didn't like the idea of me using an Inspect to see if there were 2 "-" in the line somewhere as my first filter to determine if I was on a "record" line yet.
His argument is that they might not include "-" in the SSN field, but, hey where do you draw the line in the sand with the clients.
I mean, I can't cover every possible format scenario that the client might just randomly decide to change to, right? There has to be some sort of standardized format, otherwise nothing would work. I am correct on this? We can "what if" ourselves to death. For example, what if they decided to put the dollar amount column where the SSN column is and vice versa. The possibilities are endless.

Thanks again to everyone for their help.
-David
 
Hi
Tom your routine works good, expect that in text, should be QUOTE("), Excel out csv format make double quote ("").

It will be nice to add in that routine, that every "" at input will be as one quote , (but not null string,
for exampale: "i like ""COBOL""",""
OUTPUT: =>i like "COBOL"<=
=><=
Barry
 
Barry,
baruch said:
Excel out csv format make double quote ("").
I actually acknowledge this fact in an earlier post.
k5tm said:
Note that I have made the simplifying assumption that a quoted field contains no embedded quotes.

Regarding the use of the reserved word QUOTE: I feel this is a matter of personal preference. I code [tt]'"'[/tt] because it requires one less mapping in the code reader's head, and thereby promotes readability. I am aware this is a point of disagreeement. You can see from my first coding example that I used the reserved word QUOTE on an UNSTRING but in a later example dropped its use to make the compund delimiter more clear.

Tom Morrison
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top