delete or extract columns by list of identifiers 2

flxms · Dec 1, 2009

Hi there,

I have a similar problem to that described in thread271-1497302, but unfortunately I haven't found a solution by now.

There is a very huge file to process consisting of 300000 columns and 1500 rows. About 20000 columns shall be deleted from that file. So it is clear, that I can't do this by writing down all the columns in an awk command like $1, $x etc. As the columns are not next to each other I can't define a range as well.

The (distinct) identifieres of the columns that shall be removed are in a text-file containing one column with 20000 identifiers (corresponding to the identifiers in the header/first line in file to process).

An equivalent of this question is of course how to extract columns(instead of delete) according to a list of identifiers. But I didn't figure out how to do this as well.

The task is probably much easier to do after columns have been transposed to rows. This did unfortunately not work due to performance issues (file size about 1GB).

Can anyone give me a hint ho to do this with awk or a shell script?
I'd appreciate any kind of help very much!

Best regards, Felix

feherke · Dec 1, 2009

Hi

Something like this ?

Code:

[blue]master #[/blue] cat Felix-head.txt
w
r
y
i

[blue]master #[/blue] cat Felix-data.txt 
q w e r t y u i o p
a s d f g h j k l ;
z x c v b n m , . /

[blue]master #[/blue] awk 'FNR==NR{h[$0]=1;next}FNR==1{for(i=1;i<=NF;i++)if($i in h)n[i]=1}{for(i=1;i<=NF;i++)if(i in n)printf"%s ",$i;print""}' Felix-head.txt Felix-data.txt
w r y i 
s f h k 
x v n ,

Tested with [tt]gawk[/tt] and [tt]mawk[/tt]. But your amount of data can not be processed with any [tt]awk[/tt] implementation :

Code:

mawk: program limit exceeded: maximum number of fields size=32767

But [tt]gawk[/tt] works. Having 20000 rows in Felix-head.txt and 100 rows/300000 columns in Felix-data.txt, takes 50 seconds to [tt]gawk[/tt]. For better performance I would try [tt]perl[/tt] instead.

Known bug : leaves a separator space at the end of each line. For now I let it there for speed consideration.

The above code keeps the enumerated columns. To remove them you have to change the [tt]if[/tt] condition before the [tt]print[/tt] :

Code:

if([highlight]!([/highlight]i in n[highlight])[/highlight])printf"%s ",$i

Feherke.

http://free.rootshell.be/~feherke/

flxms · Dec 2, 2009

Hi Feherke,

this is absolutely incredible, exactly what I was desperately looking for. Your code worked out of the box. No more than 10 minutes and the job was done

I tried several scripts from experienced specialists in our institute as well as code from the internet during the last days - none worked with my high dimensional dataset.

Worst problem was memory consumption, that crashed the server (over 30GB of memory) several times. Your script didn't use more than 1% of available memory.

I still can't believe it and can't tell you how happy I am about your extremely professional, structured, understandable and effective answer. This saved my whole research project, which would have been in real trouble otherwise!

The seperator at the end of each line is no problem for the downstram applications in my case. But maybe you want to provide some quick awk code to remove it in a second step - just as a reference for others who will certainly come accross this thread in the future?

Those of you future readers who are interested in a solution in Perl I'd like to refer to

http://www.unix.com/shell-programmi...eleting-columns-list-file.html#post302376515,

where I asked the same question.

Thank you so much!
Best regards and greetings from Munich, Germany, Felix

feherke · Dec 2, 2009

Hi

Felix said:
But maybe you want to provide some quick awk code to remove it in a second step

We usually remove trailing characters with code like this :

Code:

awk '{sub(/ $/,"")}1' /input/file

But if you are sure ( like in this case ) that all lines have the same amount of trailing characters to remove, string functions should be faster than regular expression :

Code:

awk '{print substr($0,1,length()-1)}' /input/file

Personally I would use an off-topic solution :

Code:

sed 's/ $//' /input/file

Feherke.

http://free.rootshell.be/~feherke/

flxms · Dec 3, 2009

Thanks again!

So I think, there is now all said regarding this problem, that should be mentioned: problem completely solved. I'll close this thread.

Hope this will help others in the future as well.
Regards, Felix.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

delete or extract columns by list of identifiers 2

flxms

Technical User

feherke

Programmer

flxms

Technical User

feherke

Programmer

flxms

Technical User

Similar threads

Part and Inventory Search

Sponsor