Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Checking for duplicate data

Status
Not open for further replies.

RICHINMINN

Programmer
Dec 31, 2001
138
Does anyone have a recommendation on how to check an input file for duplicate data? I've got a file that caused a ton of problems this past week by duplicating the correct set of records six times. (The company sending the file was having FTP problems, and so ended up sending the file six times, the first five files were 99.6 complete - 281,010 records out of 282,186), followed by one complete file. The resulting file contained 1,687,236 records. The file is fixed length, with a length of 320 bytes.

(This is on a large, corporate Amdahl system, with tons of storage, running IBM COBOL II on OS/390.)

Question:
How can I pre-process this file to ensure that if any duplicate data is sent, I can bypass the duplicate records? There are no handy fields that are unique to each record. I was wondering about generating a checksum for each record, then writing that checksum value out to a VSAM file. If I would encounter a duplicate record, it should generate an identical checksum value, and which would show up as already having been written out to the VSAM file.

Does anyone have a sample of such a checksum algorithm? Or any other ideas?

Rich (in Minn.)
 

Why not do a sort on the records using the whole record as the key and tell it to remove the duplicate records.

DF/SORT is easy to setup to do this.
 
kkitt,
I had thought of that, but the data records are "grouped" within the file. This is HIPAA health insurance claims information, with each claim spanning 10 or more records. (The file I'm working with has 282,186 records, covering 18,586 claims.)

If I were to sort it as you described, I WOULD be able to get rid of the duplicates, but then I would have the first line of all 18,586 claims together, followed by the second line of all 18,586 claims together, and so on, with no way to sort them to get them back in the original sequence. (This process is the front end of the claim processing sequence, where a unique claim number is assigned to each claim.)

Rich (in Minn.)
 
Rich,
are you saying that there are no real key fields on every record that identifies it as belonging (or not) to a claim?
Marc
 
If this is an 837 claim file, each transmission begins with an ISA segment and ends with an IEA segment. They both contain an interchange control number that is normally unique to each exchange. The IEA also contains a functional group count so that you can confirm the transmission contains all the data that was sent.

If your sender is not incrementing the interchange control number for each transmission, you'll have to look further to find duplicates, for example, the date/time stamp in the GS functional header.

If you don't have a copy of the ANSI 837 Implementation Guide, you should get one!! They are available for download from several web sites. The official site is Washington Publishing Co.
Glenn
Brainbench MVP for COBOL II
 
Rich,

Your proposed technique should work quite well if you use an algorithm that does a good job of generating a unique value for each input, i.e. the generated value will be identical only for identical input 99.99999999% of the time. A normal checksum cannot do this.

There are several algorithms that will do the job.

If your compiler does not support bitwise logical operations (logical xor for example), you can get the job done using table lookup. It won't be a performance giant, but it will get the job done in a one-off. I have some code I can get you if you need this. Tom Morrison
 
Hi,
Nobody seems to have asked the question, "How did the individual files get grouped together. Why didnt each successive transmission cover up the previous file?" I would think the first record (header) of each file could be used to log in a control file and see if the batch had already been processed before continuing.
 
Rich,

Given that you are on a mainframe, you might already have a library routine for &quot;Secure Hash Algorithm&quot; which would work, I think. The standard SHA produces a 160-bit message digest for any message (e.g. a cobol data record) as long as the length of the message < 264. You would use the resulting message digest value as you are proposing to use a checksum.

For more info, Google &quot;secure hash algorithm&quot;. Tom Morrison
 
mrregan -

It's not terribly unusual for ftp to be set to append for transmissions like these. It allows files to accumulate on the receiving system over the weekend, for instance, and then be processed together.

I think if Rich takes a close look at the file content, preventing duplicate processing will be simple and straightforward.

Glenn
 
I think the key to solving this problem - or part of it - has already been declared by Rich himself:

&quot;(This process is the front end of the claim processing sequence, where a unique claim number is assigned to each claim.)&quot;

Rich, we need your help here. It seems the front end process is not equipped to deal with duplicates. It also seems that some process &quot;knows&quot; how to group these records: &quot;where a unique claim number is assigned to each claim&quot;.

Can you forget about 'un-duplicating' these records for now; and go ahead with the claim number assignment? This will surely &quot;group&quot; your records and make the 'un-duplication' quite easy, wouldn't it? Granted, you may have to create a temporary data set - in order to sort this out.

Dimandja
 
I'm sure Rich's mgmt doesn't want to hear this, but something as simple as a hdr/trlr with rec count and some amt total or hash tots, as someone mentioned, could put this problem to bed.

I'm amazed at the number and severity of production problems, not to mention lost sleep, caused by bad xmissions, most of which are caused by a lack of xmission verification.

Jack
 
Jack -

&quot;This is HIPAA health insurance claims information&quot;.

If this is truly HIPAA-compliant claims information, it follows the X12 837 standard. As you might imagine, all the X12 EDI standards include adequate controls to prevent duplicate transmissions, lost records, etc, etc. (See my initial post.)

Seems like a simple thing to me.

Glenn
 
Hi Glenn,

I am still searching the link you provided for some transmission related clues.

But, if FTP is used, how would FTP prevent duplicate transmissions, etc...? It looks like FTP failed, or the receiving/sending resources failed, while some data was being appended to the same file.

As for the the ISA segment terminator, this could be represented by a new line character, for example. Depending on the options selected for FTP, some of these special characters could be lost or hard to distinguish.

I think Rich needs to give us more information on this. For instance, What is the transmission protocol? And is it used/ignored on the sending/receiving end? Does the 'front end' application look for these segments? So many questions, but without Rich's answers, I don't see how we can provide him with solutions.

Dimandja
 
D -

I agree that Rich has been strangely silent on this issue.

I think the issue is not preventing duplicate transmissions; you probably can't do that. You should, however, be able to (a) confirm the transmission you received was complete and (b) prevent posting a complete transmission that is a duplicate of another, already posted, transmission. (Note that in Rich's case, he has multiple transmissions in a single file. That shouldn't matter since you can separate the transmissions by ISA/IEA pairs for processing.)

I believe that the newline character is not an allowable character in an X12 transmission. IIRC, there is a restricted set of possible terminator characters. The most common segment terminator is tilde (~).

Glenn
 
Sorry for not having responded before this, but this issue got put on the back burner in deference to other more immediate problems.

By the way, these duplicate files that I experienced are FTPed to the mainframe, where they each create a new generation of a GDG dataset. When my process runs, it &quot;grabs&quot; all generations of the dataset and proceeds from there. This does allow for the possibility of multiple files being FTPed before my process can run using them as input, as 3gm surmised.

The input file, at this point in my process, has already had claim numbers assigned to each group of records comprising a single claim. It's just that I was ending up with the same claim being processed 6 times, under 6 different claim numbers, because each claim was found 6 times in the file, with claim numbers simply assigned sequentially to the claims.

I've been experimenting with taking the pertinent information from each claim &quot;envelope&quot;, starting with the CA0 record and continuing through the XA0 record, and building a &quot;key&quot; using subscriber number, date of claim, diagnosis, etc. and using these fields in a VSAM file to determine whether a later claim is a duplicate or not. My analysis on this approach is still incomplete.

Thanks for everyone's input to this problem. When I get it resolved (which I will!) I'll let you know what I did.

Rich (in Minn.)
 
Rich -

If you're dealing with CA0 records etc, you're NOT dealing with HIPAA compliant data. You're likely dealing with what's called a National Standard Format (NSF) Version 3.01 file (or perhaps an older version) for submission of professional (HCFA 1500) claims.

The AA0 segment contains a unique submission number that can be used to eliminate duplicate transmissions. See:
Glenn
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top