Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

removing duplicate fields from one file and placing into another 3

Status
Not open for further replies.

jpor

Technical User
Nov 29, 2000
212
GB
Hi All,

I have done a search on this subject and found many a question and answer. But this one is a slightly different issue.

I need to check the 18th field (';' is the seperator) and remove any lines that have duplication in the 18th field and move these to another file. Here is some example data:

10000;105;CHEAPSIDE 105;2005-03-13 09:21:09;Mr;Jon;Smith;45;The Hollow;Cheapside;;GTX 1BB;;;;;;357888888888888;AG MYV65VODA P BLUE;N;cond-des;VODAFONE;1;;;;;;;7;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;
10001;106;BLACKPOOL 106;2005-03-13 09:21:09;Mrs;Anne;Jones;45;The Hollow;Cheapside;;GTX 1BB;;;;;;357888888888889;SAG MYX2 ORG P BLUE;G;cond-des;VODAFONE;;;;;;;26;17;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;
10002;108;SCARBOROUGH 108;2005-03-13 09:21:09;Mr;Jack;Jones;45;The Hollow;Cheapside;;GTX 1BB;;;;;;357888888888810;AIO SAGEM 820 TMOBIL;P;cond-des;VODAFONE;7;;;;;;;;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;
10003;109;DARLINGTON 109;2005-03-13 09:21:09;Mr;Ian;Beard;45;The Hollow;Cheapside;;GTX 1BB;;;;;;357888888888811;ALCATEL 302 SIL A/C;N;cond-des;VODAFONE;;;;;;;;;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;
10004;111;ALDERSHOT 111;2005-03-13 09:21:09;Mrs;Louise;Hogg;45;The Hollow;Cheapside;;GTX 1BB;;;;;;357888888888812;ALCATEL 302 SIL A/C;G;cond-des;VODAFONE;99;;;;;;;;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;
10005;112;ROTHERHAM 112;2005-03-13 09:21:09;Mr;Adam;Shenton;45;The Hollow;Cheapside;;GTX 1BB;;;;;;357888888888813;SONY J70 T MOBILE;G;cond-des;VODAFONE;5;;;;;;;;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;
10006;113;GLASGOW 113;2005-03-13 09:21:09;Mrs;Leah;McDonald;45;The Hollow;Cheapside;;GTX 1BB;;;;;;357888888888814;SIEM A55 VIRG BUNDLE;G;cond-des;O2;1;;;;;;;;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;
10007;114;MILTON KEYNES 114;2005-03-13 09:21:09;Mr;John;Smith;45;The Hollow;Cheapside;;GTX 1BB;;;;;;357888888888815;SIE MC60 TMOB P;G;cond-des;O2;26;;;;;;;;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;
10008;115;LIVERPOOL PARKER 115;2005-03-13 09:21:09;Mr;John;Trite;45;The Hollow;Cheapside;;GTX 1BB;;;;;;357888888888816;SAM A300 GSM C LGHT;P;cond-des;O2;1;;;;;;;;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;
10009;116;CARLISLE 116 116;2005-03-13 09:21:09;Mr;Brian;Bungled;45;The Hollow;Cheapside;;GTX 1BB;;;;;;357888888888817;PAN GD87 ORG C;P;cond-des;O2;;;;;;;;;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;
10010;117;HOUNSLOW 117;2005-03-13 09:21:09;Miss;Karen;Brown;45;The Hollow;Cheapside;;GTX 1BB;;;;;;357888888888818;PAN X300 GSM C SILV;P;cond-des;O2;;;;;;;;;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;
10011;118;EDINBURGH 118;2005-03-13 09:21:09;Mr;Norman;Wisdom;45;The Hollow;Cheapside;;GTX 1BB;;;;;;357888888888819;PAN X300 GSM C SILV;N;cond-des;3;17;;;;;;;;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;
10012;119;KIRKCALDY 119;2005-03-13 09:21:09;Miss;Julie;Jacobs;45;The Hollow;Cheapside;;GTX 1BB;;;;;;357888888888820;PAN X300 GSM C SILV;N;cond-des;3;7;;;;;;;;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;
10013;121;CUMBERNAULD 121;2005-03-13 09:21:09;Mr;John;Terry;45;The Hollow;Cheapside;;GTX 1BB;;;;;;357888888888823;NOK 3200 VIRG P CLR;N;cond-des;3;17;;;;;;;;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;
10014;122;CHATHAM 122;2005-03-13 09:21:09;Mr;Jack;Kahn;45;The Hollow;Cheapside;;GTX 1BB;;;;;;357888888888822;NOK 8310 GSM C LGHT;N;cond-des;O2;5;;;;;;;;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;
10015;123;SOUTHPORT 123;2005-03-13 09:21:09;Mr;Liam;Neeson;45;The Hollow;Cheapside;;GTX 1BB;;;;;;IMEI99;NOK 8310 GSM C LGHT;;cond-des;O2;4;;;;;;;;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;
10016;124;DERBY EAST ST 124;2005-03-13 09:21:09;Ms.;Sara;Sandringham;45;The Hollow;Cheapside;;GTX 1BB;;;;;;357888888888822;NOK 8310 GSM C HOT;F;cond-des;O2;9;;;;;;;;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;
10017;125;DERBY EAST ST 125;2005-03-13 09:21:09;Ms.;Sara;Sandringham;45;The Hollow;Cheapside;;GTX 1BB;;;;;;357888888888823;NOK 8310 GSM C HOT;N;cond-des;VODAFONE;7;;;;;;;;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;
10018;126;DERBY EAST ST 126;2005-03-13 09:21:09;Ms.;Sara;Sandringham;45;The Hollow;Cheapside;;GTX 1BB;;;;;;357888888888825;NOK 8310 GSM C HOT;G;cond-des;VODAFONE;6;;;;;;;;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;
10019;127;DERBY EAST ST 127;2005-03-13 09:21:09;Ms.;Sara;Sandringham;45;The Hollow;Cheapside;;GTX 1BB;;;;;;357888888888888;NOK 8310 GSM C HOT;P;cond-des;VODAFONE;5;;;;;;;;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;
10020;128;DERBY EAST ST 128;2005-03-13 09:21:09;Ms.;Sara;Sandringham;45;The Hollow;Cheapside;;GTX 1BB;;;;;;357888888888888;NOK 8310 GSM C HOT;N;cond-des;VODAFONE;4;;;;;;;;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;
10021;129;DERBY EAST ST 129;2005-03-13 09:21:09;Ms.;Sara;Sandringham;45;The Hollow;Cheapside;;GTX 1BB;;;;;;357888888888888;NOK 8310 GSM C HOT;G;cond-des;VODAFONE;3;;;;;;;;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;

For example: I have 2 lots of "357888888888823" and 3 lots of "357888888888888". I need these lines stripping out and transferred to another file.

Any ideas?

Thanks in advance.

( "To become Wise, first you must ask Questions")
 
Do you need all of the duplicate lines stripping out, or to leave one in the original file, transferring the duplicates only to another file?
 
Need all of the duplicate lines stripping out. This will then be used as a pointer to the people who have ftp'ed the file, so they can then make a decision to which line should have been included in the first place.

Thanks.



( "To become Wise, first you must ask Questions")
 
To re-cap yes I need to remove any line that is duplicated and have this transferred to another file.

For example:

These 3 lines are the same:

1BB;;;;;;357888888888888;NOK 8310 GSM C HOT;P;cond-des;VODAFONE;5;;;;;;;;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;
10020;128;DERBY EAST ST 128;2005-03-13 09:21:09;Ms.;Sara;Sandringham;45;The Hollow;Cheapside;;GTX 1BB;;;;;;357888888888888;NOK 8310 GSM C HOT;N;cond-des;VODAFONE;4;;;;;;;;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;
10021;129;DERBY EAST ST 129;2005-03-13 09:21:09;Ms.;Sara;Sandringham;45;The Hollow;Cheapside;;GTX 1BB;;;;;;357888888888888;NOK 8310 GSM C HOT;G;cond-des;VODAFONE;3;;;;;;;;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;

I need those 3 lines taking out of the original file and moving over to another file.

I hope this clarifies things?


( "To become Wise, first you must ask Questions")
 
Maybe this will help:
Code:
#!/bin/ksh

rm file1.dup
cp file1.txt file1.new
sort -t';' -k 18 file1.txt|\
awk 'BEGIN{FS=";";k="?";rec=0}
{if (k==$18) rec+=1;
 else
 {if (k!="?") print k,rec;
  rec=0;}
k=$18;
} END {print k,rec;}' - |\
while read key n
do
 echo "Key read: $key:$n"
 if [ n -ne 0 ]
 then
   echo "Processing: $key:$n"
   awk 'BEGIN{FS=";"}{if ($18==k) print $0;}' k=$key file1.new >>file1.dup
   awk 'BEGIN{FS=";"}{if ($18!=k) print $0;}' k=$key file1.new >>file1.tmp
   mv file1.tmp file1.new
 fi 
done
PS: There is a way to write to different files from within awk, buut I don't have the manual available right now.
Good luck! [noevil]

----------------------------------------------------------------------------
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
 
Assuming your original file's name is [tt]file.orig[/tt], you could try this...
Code:
#!/bin/ksh

# Remove dupes based on 18th field
sort -u -t ';' -k 18,18n file.orig | sort > file.nodupes

# Sort the original file for the following comm
sort file.orig > file.sorted

# Find records in sorted that aren't in nodupes (these are the dupes)
comm -3 file.sorted file.nodupes | sort -t ';' -k 18,18n > file.thedupes

# Remove unneeded files
rm file.sorted
That'll give you [tt]file.nodupes[/tt] which doesn't have your dupe records, and [tt]file.thedupes[/tt] which ONLY has the duplicate records.

Hope this helps.
 
Using awk...
Code:
awk -F\; 'NR==FNR {arr[$18]++; next}
  { print > ("outfile." (arr[$18]>1 ? "thedupes" : "nodupes"))
  }' infile infile
 
Thanks guys. Will give your recommendations a go.


( "To become Wise, first you must ask Questions")
 
Okay. I have tried all three examples presented to me. These examples work, BUT what I am after is that say I have 2 lines with the 18 field showing the following:

10020;128;DERBY EAST ST 128;2005-03-13 09:21:09;Ms.;Sara;Sandringham;45;The Hollow;Cheapside;;GTX 1BB;;;;;;357888888888888;NOK 8310 GSM C HOT;N;cond-des;VODAFONE;4;;;;;;;;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;
10021;129;DERBY EAST ST 129;2005-03-13 09:21:09;Ms.;Sara;Sandringham;45;The Hollow;Cheapside;;GTX 1BB;;;;;;357888888888888;NOK 8310 GSM C HOT;G;cond-des;VODAFONE;3;;;;;;;;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;

Where I have this situation I need both lines removing from the primary file and moving to the duplicate file.
At the moment the example codes I have only removes just the duplicate line(s) and NOT the duplicate line(s) and the original line it duplicates (If that makes sense?).
So where I have 2 lots of "357888888888888" I need both these lines removing from the original file and placing in the duplicate file.
There is method to this madness believe me.

Anyone know how I could achieve this?

Sorry to sound like I'm un-grateful, I may not have explained the situation properly in the first place.

Thanks for yor patience.




( "To become Wise, first you must ask Questions")
 
My solution may not be optimum, but it does remove ALL duplicate records, check it out:
Code:
#!/bin/ksh
rm file1.dup
cp file1.txt file1.new
sort -t';' -k 18 file1.txt|\
awk 'BEGIN{FS=";";k="?";rec=0}
{if (k==$18) rec+=1;
 else
 {if (k!="?") print k,rec;
  rec=0;}
k=$18;
} END {print k,rec;}' - |\
while read key n
do
 echo "Key read: $key:$n"
 if [ n -ne 0 ]
 then
   echo "Processing: $key:$n"
   awk -F\; '{if ($18==k) print $0;}' k=$key file1.new >>file1.dup
   awk -F\; '{if ($18!=k) print $0;}' k=$key file1.new >>file1.tmp
   mv file1.tmp file1.new
 fi 
done
[2thumbsup].


----------------------------------------------------------------------------
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
 
Hi LKBrwnDBA. Noticed the slight change in your code. I have now implemented this but I still get the original line that was duplicated in the original file.

For example:

Part of what's in the duplicated file after running the code:

10012;119;KIRKCALDY 119;2005-03-13 09:21:09;Miss;Julie;Jacobs;45;The Holl
ow;Cheapside;;GTX 1BB;;;;;;357888888888820;PAN X300 GSM C SILV;N;cond-des;3;7;;
;;;;;;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;
10013;121;CUMBERNAULD 121;2005-03-13 09:21:09;Mr;John;Terry;45;The Hollow;C
heapside;;GTX 1BB;;;;;;357888888888820;NOK 3200 VIRG P CLR;N;cond-des;3;17;;;;;;
;;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;

The line with the number "357888888888820" from the 18th field is still present in the original file afterwards:

10012;119;KIRKCALDY 119;2005-03-13 09:21:09;Miss;Julie;Jacobs;45;The Holl
ow;Cheapside;;GTX 1BB;;;;;;357888888888820;PAN X300 GSM C SILV;N;cond-des;3;7;;
;;;;;;flt dsc;0;chrge cond;1003052S;12345;2005-03-14 13:14:15;

I need to find all duplicates and when a duplication is found then both the duplication AND the line it duplicated REMOVING from the original file and placing intot eh duplicates file. So in other words with "357888888888820" both these lines need taking out as it was duplicated or triplicated or quadruplicated etc.. and moving intot he duplicated file.

I hope that clears things?


( "To become Wise, first you must ask Questions")
 
How come, I get these results (no duplicates in new file):
Code:
$ cat file1.dup|awk -F\; '{print $18}'
357888888888822
357888888888822
357888888888823
357888888888823
357888888888888
357888888888888
357888888888888
357888888888888
$ cat file1.new|awk -F\; '{print $18}'
357888888888889
357888888888810
357888888888811
357888888888812
357888888888813
357888888888814
357888888888815
357888888888816
357888888888817
357888888888818
357888888888819
357888888888820
IMEI99
357888888888825
$

----------------------------------------------------------------------------
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
 
Thanks LKBrwnDBA. It was me and how I was outputing the files. The script works fine.



( "To become Wise, first you must ask Questions")
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top