Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Python Code - Eliminate all the duplicated occurences 4

Status
Not open for further replies.

nvhuser

Programmer
Apr 10, 2015
48
DE

Hello, I would like to write a python script that eliminate all the duplicated occurrences in the second column, keeping the first match.

For an input like this:
Code:
101000249 101000249
101000250 5552931
101000251 101000251
101000254 5552931
101000255 101000255
101000256 101000256
101000257 5552605
101000258 5552605
101000259 101000259
101000260 101000260


I should get that:
Code:
101000249 101000249
101000250 5552931
101000251 101000251
101000255 101000255
101000256 101000256
101000257 5552605
101000259 101000259
101000260 101000260

The python code that I attempted is the following:

Code:
#/bin/python

file_object=open('file1.txt','r')
file_object2=open('file2.txt','w')

read_data=file_object.readlines()
nd=[]

for line in read_data:
        s=line
        if s[2] not in nd:
                nd.append(s[2])
                line = line.strip('\n')
                file_object2.write(str(line)+"\n")


Thank you very much for your support!
 
I think I get it, I was filtering using the third column instead of the second field.

Code:
#/bin/python
file_object=open('sample.txt','r')
file_object2=open('sample2.txt','w')

read_data=file_object.readlines()
nd=[]

for line in read_data:
        s=line
        b=s.split()
        if b[1] not in nd:
                nd.append(b[1])
                line = line.strip('\n')
                file_object2.write(str(line)+"\n")

Does anybody agree?
 

The Python script took about 10hours to remove the duplicated entries for a file with 1970028 lines. Do you guys have some suggestions to make the code a bit faster?
I hear that Python is very fast, but I am wondering if a shell script would be faster in this case.
 
Hi

nvhuser said:
[tt]read_data=file_object.readlines()[/tt]
You mean, you put the script to slurp into the memory the whole ~2 million line file ? Well, I usually avoid such practices.
[ul]
[li]Variable s is not necessary[/li]
[li][tt]strip()[/tt]ing the newline just to add it back in the next line is pointless[/li]
[li]Converting the line into [tt]str()[/tt] is pointless[/li]
[/ul]
Python:
[gray]#!/bin/python[/gray]

file_object[teal]=[/teal][COLOR=orange]open[/color][teal]([/teal][i][green]'sample.txt'[/green][/i][teal],[/teal][i][green]'r'[/green][/i][teal])[/teal]
file_object2[teal]=[/teal][COLOR=orange]open[/color][teal]([/teal][i][green]'sample2.txt'[/green][/i][teal],[/teal][i][green]'w'[/green][/i][teal])[/teal]
nd[teal]=[][/teal]

[b]for[/b] line [b]in[/b] file_object[teal]:[/teal]
        b[teal]=[/teal]line[teal].[/teal][COLOR=orange]split[/color][teal]()[[/teal][purple]1[/purple][teal]][/teal]
        [b]if[/b] b [b]not in[/b] nd[teal]:[/teal]
                nd[teal].[/teal][COLOR=orange]append[/color][teal]([/teal]b[teal])[/teal]
                file_object2[teal].[/teal][COLOR=orange]write[/color][teal]([/teal]line[teal])[/teal]

file_object[teal].[/teal][COLOR=orange]close[/color][teal]()[/teal]
file_object2[teal].[/teal][COLOR=orange]close[/color][teal]()[/teal]

But on ~2 million lines maybe a database would perform better. If that processing has to be done frequently, I would give it a try with an SQL database. If nothing else is handy, SQLite may do it too.

Feherke.
feherke.ga
 
above code should work faster if you change nd to a dict instead of a list
searching for an item in a list takes takes linear time O(n).
searching for an item in a dictionary requires constant time O(1).
 
Hi

JustinEzequiel said:
above code should work faster if you change nd to a dict instead of a list
Great suggestion ! [medal]

Initially sounds scary to put so many data in a dict, but seems to be far below the limit where dict becomes unfeasible due to internal storage's memory requirement.

Feherke.
feherke.ga
 
Sorry guys, it seems that I need to study harder.

Are you suggesting to do something like this?

Code:
for line in file_object:
    dict['+str(line.split()[0])+']="+str(line.split()[1])+"
 
Hi

Definitely not.
Python:
[gray]#!/bin/python[/gray]

file_object[teal]=[/teal][COLOR=orange]open[/color][teal]([/teal][i][green]'sample.txt'[/green][/i][teal],[/teal][i][green]'r'[/green][/i][teal])[/teal]
file_object2[teal]=[/teal][COLOR=orange]open[/color][teal]([/teal][i][green]'sample2.txt'[/green][/i][teal],[/teal][i][green]'w'[/green][/i][teal])[/teal]
nd[teal]=[highlight]{}[/highlight][/teal]

[b]for[/b] line [b]in[/b] file_object[teal]:[/teal]
        b[teal]=[/teal]line[teal].[/teal][COLOR=orange]split[/color][teal]()[[/teal][purple]1[/purple][teal]][/teal]
        [b]if[/b] b [b]not in[/b] nd[teal]:[/teal]
                nd[teal][highlight][[/highlight][/teal]b[highlight][teal]]=[/teal]None[/highlight]
                file_object2[teal].[/teal][COLOR=orange]write[/color][teal]([/teal]line[teal])[/teal]

file_object[teal].[/teal][COLOR=orange]close[/color][teal]()[/teal]
file_object2[teal].[/teal][COLOR=orange]close[/color][teal]()[/teal]

Feherke.
feherke.ga
 
Wow, I tested the Python code using with dictionary and it is really fast.

Thank you both of you!

Feherke said:
But on ~2 million lines maybe a database would perform better. If that processing has to be done frequently, I would give it a try with an SQL database. If nothing else is handy, SQLite may do it too.

Is SQL easy to use? I have some experience with MySQL, will it help?
 
Hi

This should do it in MySQL :
Code:
[b]create table[/b] sample [teal]([/teal]
    id [maroon]integer[/maroon] [b]primary key auto_increment[/b][teal],[/teal]
    c1 [maroon]text[/maroon][teal],[/teal]
    c2 [maroon]text[/maroon]
[teal]);[/teal]

[b]load data infile[/b] [i][green]'sample.txt'[/green][/i]
[b]into table[/b] sample
[b]fields terminated by[/b] [i][green]' '[/green][/i]
[teal]([/teal]c1[teal],[/teal] c2[teal]);[/teal]

[b]select[/b]
c1[teal],[/teal] c2

[b]from[/b] sample [b]join[/b] [teal]([/teal]
    [b]select[/b]
    min[teal]([/teal]id[teal])[/teal] id

    [b]from[/b] sample

    [b]group by[/b] c2
[teal])[/teal] foo [b]using[/b] [teal]([/teal]id[teal])[/teal]

[b]order by[/b] id

[b]into outfile[/b] [i][green]'sample2.txt'[/green][/i]
[b]fields terminated by[/b] [i][green]' '[/green][/i][teal];[/teal]

[b]drop table[/b] sample[teal];[/teal]
Note that input file will be /var/lib/mysql/<database_name>/sample.txt. Not played much with it, but for your ~2 million records an index on column c2 may help.


Feherke.
feherke.ga
 
Hi

After all these discussions I was curious how well MySQL would perform. So I generated a file with 2 million rows from which 1 million unique.
[ul]
[li]Python / dict : ~2 seconds[/li]
[li]MySQL : ~2 minutes[/li]
[li]Python / list : ~2 hours for 99%, then the last 167 rows seemed to never finish so I killed it[/li]
[/ul]
( While already included off-topic solution too, AFAIK the shortest code for this task would be 15 characters of Awk : [tt]n[teal][[/teal][navy]$2[/navy][teal]]?[/teal][purple]0[/purple][teal]:[/teal]n[teal][[/teal][navy]$2[/navy][teal]]=[/teal][purple]1[/purple][/tt]. With this code [tt]gawk[/tt] does the work in ~2 seconds and [tt]mawk[/tt] in ~2 minutes. )

Feherke.
feherke.ga
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top