finding duplicates in a CSV file

rewdee · Sep 27, 2001

I haven't worked this low level and I am wondering how to do this.

I receive daily a CSV file of about 5000 lines. I need to check for duplicates in a particular column. I have two ideas on how to go about doing this. One way of doing this could be to iterate through the file adding the column value to a TStringList if it is not already in the list. If there's already a value in the TStringList, add the value to a TStringList that contains duplicates. The above method I would only iterate through the file once.

A second method could be making two loops, the first to iterate through the file and a second loop iterating throught the file looking for the current value. If there is a match then add to a TStringList.

Which is the better method? Which is faster? Can anyone share some code if they have done something similiar?

Thanks,
Rewdee

babymaker · Oct 1, 2001

I know this is a Delphi forum, but I once had a huge text file that had to be imported into SQL for a Delphi app. I wrote some code using Delphi that opened the file read each line and converted some of the data from EBCIDIC to ASCII.

I then rewrote the conversion in about 5 lines of Perl which ran about 15 to 20 times faster. Had to do some research to figure out what Perl was all about, but the time saving was incredible.

So i guess the question is... Do you have to do this in Delphi?

rewdee · Oct 3, 2001

I haven't used PERL but the Delphi routine that I created below uses just a few lines. The extra lines listed is just for creating the TStringList objects, and setting up my progress bar.

*************************************************
procedure TfrmCampData.btnDupClick(Sender: TObject);
var
ContentSL,
tempSL,
DupSL:TStringList;
sLine:string;
tf:TextFile;
LineCount,
PosCount,
i:Integer;
begin
setStatusMessage('Searching For Duplicates');
Assignfile(tf,'C:\qryr_DetailedMandays-Company.txt');
Reset(tf);

LineCount := 0;
while not Eof(tf) do
begin
ReadLn(tf,sLine);
Inc(LineCount);
end;

ContentSL:=TStringList.Create;
DupSL:=TStringList.Create;
tempSL:=TStringList.Create;
PosCount:=1;
Reset(tf);

while not eof(tf) do
begin
Readln(tf,sLine);
ContentSL.CommaText:=sLine;

if tempSL.IndexOf(contentSL[6])>0 then
DupSL.Add(ContentSL[6])
else
tempSL.Add(ContentSL[6]);

inc(PosCount);
pBar.Position:= Round(PosCount/LineCount*100);
end;
clearmessages;
closefile(tf);

******************************************************

The bold line is what does the work of seperating my columns.

Rewdee

Sisto · Oct 3, 2001

If the unique records in your daily CSV file is going to be appended to another file, a database file, then I would consider using some database components. Otherwise the first method you described is fine for the sakes of simplicity. In my opinion.

Some methods and declarations which might be of interest if you use the stringlist method you described:

UniqueFields, UniqueRecords : TStrings;
aField, aCSVRecord : String

UniqueFields := Tstringlist.create;

{obtain your aField after reading a record from the CSV file (aCSVRecord)}

If UniqueFields.IndexOf(aField) < 0 then
begin
{IndexOf returns –1 if no match is found}
UniqueFields.add(aField);
UniqueRecords.add(aCSVRecord);
end;

{when finished reading and testing all CSV records}
UniqueRecords.SaveToFile(‘c:\MyUniqueRecords.csv’);

{If you need speed, you need to implement some database components}

Good luck.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

finding duplicates in a CSV file

rewdee

Programmer

babymaker

Programmer

rewdee

Programmer

Sisto

Programmer

Similar threads

Part and Inventory Search

Sponsor