Find duplicates in a file

dvknn · Feb 17, 2004

Hi,

This is a piece of a report that I have to generate. I am kinda stuck at this point..

I have file which may have duplicates in it. I would like to know the duplicates in the file and the number of times the word has been repeated..

The sample file looks like this..

2003.11.10,chapstick, lipstick - FILTER: 1
2003.11.10,chapstick - FILTER: 1
2003.11.10,chapstick - FILTER: 1

In the output, I want to see
2003.11.10|chapstick,lipstick|FILTER: 1|1
2003.11.10|chapstick|FILTER: 1|2

Can anybody show me HOW?

Thanks

sedj · Feb 17, 2004

One way of doing this would be :

Read the file in, line by line using a BufferedReader object and the readLine() method into an ArrayList. Then use the java.util.Arrays.sort(myArrayList). Then use a "for" loop to rip through the array, checking that the last element does not equal the previous one, and if so sending it to a file, or stdout or whatever. Like :

Code:

String thisOne = &quot;&quot;;
String lastOne = &quot;&quot;;
for (int i = 0; i < myArrayList.size(); i++) {
  lastOne = thisOne;
  thisOne = (String)myArrayList.get(i);
  if (!thisOne.equals(lastOne) {
    System.out.println(thisOne);
  }
}

dvknn · Feb 17, 2004

Well, I do not quite agree with you. the duplicate strings can occur anywhere in the report..So, I cannot compare the 'lastOne with thisOne'.

Here's what I have..and I am pretty close..I am going wron somewhere..Can anybody identify??

import java.io.*;
import java.util.*;

public class WordCount {

//static TextReader in; // An input stream for reading the input file.
static PrintWriter out; // Output stream for writing the output file.
static BufferedReader b = null;
//static BufferedWriter pOut = null;

static class WordData
{
// Represents the data we need about a word: the word and
// the number of times it has been encountered.
String word;
int count;
//String searchDt;
WordData(String w)
{
// Constructor for creating a WordData object when
// we encounter a new word.
word = w;
count = 1; // The initial value of count is 1.
}
/*WordData(String searchDate, String w)
{
// Constructor for creating a WordData object when
// we encounter a new word.
searchDt = searchDate;
word1 = w;
count1 = 1; // The initial value of count is 1.
}*/

} // end class WordData

static class CountCompare implements Comparator
{
// A comparator for comparing objects of type WordData
// according to their counts. This is used for
// sorting the list of words by frequency.
public int compare(Object obj1, Object obj2)
{
WordData data1 = (WordData)obj1;
WordData data2 = (WordData)obj2;
return data2.count - data1.count;
// The return value is positive if data2.count > data1.count.
// I.E., data1 comes after data2 in the ordering if there
// were more occurrences of data2.word than of data1.word.
// The words are sorted according to decreasing counts.
}
} // end class CountCompare

public static void main(String[] args)
{
// The program opens the input and output files. It reads
// words from the input file into a TreeMap, in which
// they are sorted by alphabetical order. The words
// are copied into a List, where they are sorted according
// to frequency. Then the words are copied from the
// data structures to the output file.

openFiles(args); // Opens input and output streams, using file
// names given as command-line parameters.
//out.println("1 -- Files opened\n&quot

;

TreeMap words; // TreeMap in which keys are words and values
// are objects of type WordData.
words = new TreeMap();

//readWords(in,words); // Reads words from the input stream and
// stores data about them in words.
readWords(b,words);
//out.println("2 -- words read\n&quot

;

List wordsByCount; // List will contain all the WordData
// values form the TreeMap, and will be
// sorted according to frequency count.

wordsByCount = new ArrayList(words.values());
Collections.sort(wordsByCount, new CountCompare());

/*out.println("Words found in the file named \"" + args[0] + "\".\n&quot

;
out.println("The number of times that the word occurred in the&quot

;
out.println("file is given in parentheses after the word.\n\n&quot

;
out.println("The words from the file in alphabetical order:\n&quot

;*/
System.out.println("Printing words alphabetically&quot

;
printWords(out, words.values()); // Print words alphabetically.
//out.println("3 -- print words completed\n&quot

;
//out.println("\n\nThe words in order of frerquency:\n&quot

;
System.out.println("Printing words by frequency count&quot

;
printWords(out, wordsByCount); // Prints words by frequency count.
/*if (out.checkError())
{
// Some error occurred on the output stream.
System.out.println("An error occurred while writing the data.&quot

;
System.out.println("Output file might be missing or incomplete.&quot

;
System.exit(1);
}*/
System.out.println(words.size() + " distinct words were found.&quot

;
} // end main()

static void printWords(PrintWriter outStream, Collection wordData)
{
// wordData must contain objects of type WordData. The words
// and counts in these objects are written to the output stream.
Iterator iter = wordData.iterator();
while (iter.hasNext())
{
WordData data = (WordData)iter.next();
outStream.println(" " + data.word + " (" + data.count + &quot

&quot

;
}
} // end printWords()

static void openFiles(String[] args)
{
// Open the global variable "in" as an input file with name args[0].
// Open the global variable "out" as an output file with name args[1].
// If args.length != 2, or if an error occurs while trying to open
// the files, then an error message is printed and the program
// will be terminated.
//out.println("In OPENFILES METHOD&quot

;
System.out.println("I am HERE&quot

;
if (args.length != 2)
{
System.out.println("Error: Please specify file names on command line.&quot

;
System.exit(1);
}
try
{
//in = new BufferedReader (new FileReader(args[0]));
b = new BufferedReader (new FileReader(args[0]));
}
catch (IOException e)
{
System.out.println("Error: Can't open input file " + args[0]);
System.exit(1);
}
try
{
out = new PrintWriter(new FileWriter(args[1]));
//pOut = new BufferedWriter (new FileWriter (args[1]));
}
catch (IOException e)
{
System.out.println("Error: Can't open output file " + args[1]);
System.exit(1);
}
} // end openFiles()

//static void readWords(TextReader inStream, Map words) {
static void readWords(BufferedReader inStream, Map words)
{
// Read all words from inStream, and store data about them in words.
// A word is any sequence of letters. Words are converted to lower
// case. Any non-letters in the input stream are ignored.
// When a word is encountered for the first time a key/value pair
// is inserted into words. The key is the word and the value is
// a new object of type WordData containing the word. When a word
// is encountered again, the frequency count in the corresponding
// WordData object is just increased by one. If an error occurs
// while trying to read the data, an error message is printed and
// the program is terminated.
String txSearchDate = null;
String line = null;
String keywordAndFilter = null;
int firstCommaPosition = 0;

System.out.println("In READWORDS METHOD&quot

;
try
{
while ((line = inStream.readLine()) !=null)
{
firstCommaPosition = line.indexOf(",&quot

;
txSearchDate = line.substring(0,firstCommaPosition);
keywordAndFilter = line.substring(firstCommaPosition+1,line.length());
keywordAndFilter = keywordAndFilter.toLowerCase();
//WordData data = (WordData)words.get(keywordAndFilter);
WordData data = (WordData)words.get(line);
/*while (true) {
while (! inStream.eof() && ! Character.isLetter(inStream.peek()))
inStream.getAnyChar(); // Skip past non-letters.
if (inStream.eof())
break; // Exit because there is no more data to read.
String word = inStream.getAlpha(); // Read one word from stream.
word = word.toLowerCase();
WordData data = (WordData)words.get(word);*/
// Check whether the word is already in the Map. If not,
// the value of data will be null. If it is not null, then
// it is a WordData object containing the word and the
// number of times we have encountered it so far.
if (data == null)
{
// We have not encountered word before. Add it to
// the map. The initial frequency count is
// automatically set to 1 by the WordData constructor.
//words.put(keywordAndFilter, new WordData(keywordAndFilter) );
System.out.println("Data DOES NOT exists..So, inserting the data for the first time&quot

;
words.put(keywordAndFilter, new WordData(line) );
}
else
{
// The word has already been encountered, and data is
// the WordData object that holds data about the word.
// Add 1 to the frequency count in the WordData object.
System.out.println("Data already exists..So, incrementing the COUNT&quot

;
data.count = data.count + 1;
}
}
inStream.close();
}
catch(FileNotFoundException fnf)
{
System.out.println ("File not found..Please check if file exists" + fnf);
System.exit(1);
}
catch (IOException ioe)
{
System.out.println ("\nBuffered Reader is not ready." + ioe);
System.exit(1);
}
catch(Exception e)
{
System.out.println("please check, ERROR has occured while reading the data.&quot

;
System.out.println(e.toString());
System.exit(1);
}
/*catch (TextReader.Error e) {
System.out.println("An error occurred while reading the data.&quot

;
System.out.println(e.toString());
System.exit(1);
}*/

} // end readWords()

} // end class WordCount

sedj · Feb 17, 2004

dvknn :
>>>> Well, I do not quite agree with you. the duplicate strings can occur anywhere in the report..So, I cannot compare the 'lastOne with thisOne'.

Perhaps if you head read my post correctly, then you would see that I suggested reading the file into an ArrayList, and then performing a java.util.Arrays.sort(arrayList).

Diancecht · Feb 18, 2004

Maybe I'm wrong, but if you use a HashTable, I think it eliminates the duplicated entries. I remember I used it.

Cheers.

Dian.

dvknn · Feb 18, 2004

Hey guys..thanks for responding..I got it..!!!!!!!
It was a small correction in my code..

Anyways, ArrayList alone might not have solved the problem..I had to implement Comparator interface.. But thanks sedj..

Dian, Hashtable eliminates duplicates. I have done that already..but my customer needed a duplicate list with number of times the word repeated.

Thanks,

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Find duplicates in a file

dvknn

IS-IT--Management

sedj

Programmer

dvknn

IS-IT--Management

sedj

Programmer

Diancecht

Programmer

dvknn

IS-IT--Management

Similar threads

Part and Inventory Search

Sponsor