Removing multiple records from a file

ryanc2 · Nov 3, 2006

I need to remove a lot of records (38k) from a file. I have a bad file containing all of the keys and then the master file needing cleaning.

started using a while loop with grep -v and appending the records to a new file, but it just doesn't seem to be a logigcal way to do it.

Any help would be appreciated.

Thanks

segment · Nov 3, 2006

Can you be slight more specific? You want to remove 38k records... Is every record on one line? Column? Is there anything specific you want removed? Any specific number of lines?

perl -e 'print $i=pack(c5,(40*2),sqrt(7600),(unpack(c,Q)-3+1+3+3-7),oct(104),10,oct(101));'

feherke · Nov 3, 2006

Hi

You mentioned a master file. Sounds like you want something like this :

Code:

[gray]# data file[/gray]
[blue]master #[/blue] cat first.txt
one line
two line
three line
four line
five line

[gray]# pattern file[/gray]
[blue]master #[/blue] cat second.txt
five
two
three

[gray]# keep only data matchig a pattern[/gray]
[blue]master #[/blue] grep -f second.txt first.txt
two line
three line
five line

[gray]# remove data matching a pattern[/gray]
[blue]master #[/blue] grep -v -f second.txt first.txt
one line
four line

Feherke.

http://rootshell.be/~feherke/

segment · Nov 3, 2006

Why not something simpler:

sed '/dont_want_this_line/{d;}' file

Where dont_want_this_line is a concurrent string in each line he doesn't in the file:

[root@mybox ]# tail -n 5 /var/log/messages
Nov 3 13:06:58 mybox dhcpd: DHCPREQUEST for 192.168.162.249 (192.168.162.91) from 00:04:13:24:67:57 via eth1
Nov 3 13:06:58 mybox dhcpd: DHCPACK on 192.168.162.249 to 00:04:13:24:67:57 via eth1
Nov 3 13:08:02 mybox dhcpd: DHCPINFORM from 192.168.162.138 via eth1: not authoritative for subnet 192.168.162.0
Nov 3 13:08:05 mybox dhcpd: DHCPINFORM from 192.168.162.138 via eth1: not authoritative for subnet 192.168.162.0
Nov 3 13:10:41 mybox dhcpd: DHCPINFORM from 192.168.162.132 via eth1: not authoritative for subnet 192.168.162.0
[root@mybox ]# tail -n 5 /var/log/messages|sed '/INFORM/{d;}'
Nov 3 13:06:58 mybox dhcpd: DHCPREQUEST for 192.168.162.249 (192.168.162.91) from 00:04:13:24:67:57 via eth1
Nov 3 13:06:58 mybox dhcpd: DHCPACK on 192.168.162.249 to 00:04:13:24:67:57 via eth1

Its a vague question hence me asking for more info

perl -e 'print $i=pack(c5,(40*2),sqrt(7600),(unpack(c,Q)-3+1+3+3-7),oct(104),10,oct(101));'

segment · Nov 3, 2006

Anyhow... Thought you could fiddle with these...

Matching a pattern
sed -n '/dont_show_me_lines_with_this/!p'
sed '/dont_show_me_lines_with_this/d'
awk '!/dont_show_me_lines_with_this/'

Based on line numbers... (prints lines 1-2000)
sed -n '1,2000p' or sed '1,2000!d'
awk 'NR==1,NR==2000'

perl -e 'print $i=pack(c5,(40*2),sqrt(7600),(unpack(c,Q)-3+1+3+3-7),oct(104),10,oct(101));'

ryanc2 · Nov 3, 2006

sorry - i normally assume people can read my mind when I speak.

More info:

Master file is one line per record and each record begins with 10 numbers (one unique record per line).

0000000001
0000000002

Bad file is a file containing all of the bad record numbers that need removing. So I basically need to remove the bad records from the master file based on the record numbers being inputted from the bad file.

I understand most of your approaches except how to feed in the record keys from the bad file.

PHV · Nov 3, 2006

man comm

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886

segment · Nov 3, 2006

diff -F goodfile badfile

Wow. comm how underrated it that!

perl -e 'print $i=pack(c5,(40*2),sqrt(7600),(unpack(c,Q)-3+1+3+3-7),oct(104),10,oct(101));'

ryanc2 · Nov 3, 2006

Sorry again, but I guess I'm not being very clear. The bad file doesn't contain the entire record, just the key that identifies the bad record in the master file - so comm or diff would kick out everything.

master file: (19,435,954 records)
1234567891 some data more data even more data

bad file: (32,727 records)
1234567891

Thanks for the help.

segment · Nov 3, 2006

Can you show me:

1 line from good_file
1 line from bad_file

perl -e 'print $i=pack(c5,(40*2),sqrt(7600),(unpack(c,Q)-3+1+3+3-7),oct(104),10,oct(101));'

ryanc2 · Nov 3, 2006

Master File:

371110794 00607320060731A200017 YYYNN 007 61.07
371113005 00607320068991A200017 NNNNN 007 59.04

Bad File:

371110794
371110795
371110796
371110797
371110798
371110799

In this example, I need to remove the first record from master and not the second record.

p5wizard · Nov 3, 2006

grep -v -f bad_file master_file

but I would prepend every line in bad_file with a caret (^) so that it matches the key only in the beginning of a record in the master_file.

in order to do that:
vi badfile
:1,$ s/^/\^/
:wq

HTH,

p5wizard

ryanc2 · Nov 3, 2006

sys_dir> grep -v -f bad_invoice_keys.txt master_invoice_file.dat > master_invoice_file.new
grep: illegal option -- f
Usage: grep -hblcnsviw pattern file . . .

what am I missing?

Annihilannic · Nov 3, 2006

Or this (no modifications required presuming the files are sorted):

[tt]join -1 1 -v 1 master badfile[/tt]

Annihilannic.

segment · Nov 3, 2006

/usr/xpg4/bin/grep ?

perl -e 'print $i=pack(c5,(40*2),sqrt(7600),(unpack(c,Q)-3+1+3+3-7),oct(104),10,oct(101));'

ryanc2 · Nov 3, 2006

had to use egrep.

Thanks for the help.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Removing multiple records from a file

ryanc2

MIS

segment

ISP

feherke

Programmer

segment

ISP

segment

ISP

ryanc2

MIS

PHV

MIS

segment

ISP

ryanc2

MIS

segment

ISP

ryanc2

MIS

p5wizard

IS-IT--Management

ryanc2

MIS

Annihilannic

MIS

segment

ISP

ryanc2

MIS

Similar threads

Part and Inventory Search

Sponsor