QUESTION

xtolgax · Sep 16, 2000

Hi my name is Tolga, I just start learning Java programming after all I am familiar with c++, but I have this project to do Is there anybody can help me out there with this project.
I want to read a text file, makes a tally of all the different words used in the file, and prints out a list of the words that appear most frequently, along with the number of times each word appears. This list should be printed in descending order of frequency of appearance, so that the most-frequently-used word is printed first,
then the second-most-frequently used word. The alphanumeric characters consist of upper and lower-case letters and digits. All other characters are called separator characters. A word is any sequence of consective alphanumeric characters that appears in the
file preceded immediately either by the beginning of the file or a separator character and followed immediately either by the end of the file or a separator character.
I want to take one command-line argument, which is the name of the file to be processed. It should read words
from the specified file, count how many times each word appears in the file, and produce output (on the standard output) showing all the words and their frequency of occurrence. The output, which should be sorted according to decreasing frequency of occurrence, should appear as follows:

a 1573
the 439
an 128
i 23
if 10

I want to use some of the classes from the java.io package, such as the java.io.FileReader class, java.lang.Character class for some methods useful for telling if a
character is alphanumeric or not. I also found the java.util.Hashtable class useful for storing words and their counts. I will probably also need to use the java.util.Enumeration interface in order to iterate over the words stored in a hash table.
One thing I find confusing is the difference between byte and char values in Java. Byte values are 8-bit values capable of representing, for example, the ASCII codes for text characters. In contrast, char values are 16-bit values that represent characters in Unicode. When I use, e.g. the read() method of the interface java.io.InputStream, I get back an integer value that you would cast to byte before using it as part of a word. On the other hand, the read() method of the interface java.io.Reader gives me an integer value that you must cast to char. If I don't apply the cast, my program will not interpret the characters properly and I will get strange results. For both interfaces, end of file is signalled by the return of a -1 from read(). I have to test for -1 before applying the appropriate cast.
Thanks [sig][/sig]

fenris · Sep 18, 2000

Try using a vector object.I did this to count the number of different file extensions in a particular directory.

Basically you create the vector object
which will hold another object WordCount.
Vector list = new Vector(100,10);

WordCount is used to hold a word and the number of times it is counted. When you get to the first word, create the WordCount object like list.add(new WordCount(word,1))

where word is a string.

public class WordCount
{
private String word;
private int count;

public WordCount(String word, int count)
{
this.word = word;
this.count = count;
}

public String getWord(){ return word;}
public int getCount(){ return count;}

public void increasCount(){count++};

}

With some minor modifications, it should work for you....

[sig]Troy Williams B.Eng. <a href=mailto:fenris@hotmail.com>fenris@hotmail.com</a> <a href= > </a> [/sig]

fenris · Sep 18, 2000

Try using a vector object.I did this to count the number of different file extensions in a particular directory.

Basically you create the vector object
which will hold another object WordCount.
Vector list = new Vector(100,10);

WordCount is used to hold a word and the number of times it is counted. When you get to the first word, create the WordCount object like list.add(new WordCount(word,1))

where word is a string.

public class WordCount
{
private String word;
private int count;

public WordCount(String word, int count)
{
this.word = word;
this.count = count;
}

public String getWord(){ return word;}
public int getCount(){ return count;}

public void increaseCount(){count++};

}

With some minor modifications, it should work for you....

[sig]Troy Williams B.Eng. <a href=mailto:fenris@hotmail.com>fenris@hotmail.com</a> <a href= > </a> [/sig]

Tom7 · Oct 22, 2000

Hi! Here's a program that does this kind of word count. Please note:
- The code is only meant to give you some input (it's not very elegant)
- I think using a hashtable is a good idea. The only problem is to sort the table when it's complete.
- My program reads the file line by line. Maybe you want to do it differently.
- The following two methods are very useful:
a) Character.isLetterOrDigit(char ch)
b) String.toUpperCase() // or String.toLowerCase()
- My program doesen't sort the hashtable. It keeps track of the highest word count (i.e. it stores the value of the word that occurs most in the variable maxWords). It converts the table to an array and then counts down from
maxWords to 1. Inside the loop each array entry is tested for equality. But I'm sure there is a better (=faster) way to sort the entries.

import java.io.*;
import java.util.Hashtable;
import java.util.Map;

public class WordCount {
static char[] buffer;
static Hashtable table = new Hashtable(100);
static int maxWords = 0;
static String path = System.getProperty("user.dir&quot

;

static void print(String s) {System.out.println(s);}
public static void main(String[] args) {
File file = new File(path+"/"+args[0]);
BufferedReader in;
String line;
try {
in = new BufferedReader(new FileReader(file));
do {
line = in.readLine();
parseLine(line);
} while (line != null);
in.close();
} catch(IOException ex) {
ex.printStackTrace();
System.exit(0);
}
printHashTable();
}

static void parseLine(String line) {
if (line == null) return;
char[] buffer = line.toCharArray();
String word = "";
for (int i = 0; i<buffer.length; i++) {
char ch = buffer;
if (Character.isLetterOrDigit(ch)) {
word+=ch;
if (i == buffer.length-1) addWord(word);
else continue;
} else { // separator
if (word.equals("&quot) continue;
else {
addWord(word);
word = "";
}
}
}
}

static void addWord(String word) {
word = word.toUpperCase();
if (!table.containsKey(word)) {
table.put(word, new Integer(1));
}
else {
int count = ((Integer)table.get(word)).intValue();
count++;
if (count > maxWords) maxWords = count;
table.put(word, new Integer(count));
}
}

static void printHashTable() {
Object[] myArray = table.entrySet().toArray();
for (int j = maxWords; j>0; j--) {
for(int i = 0; i<myArray.length; i++) {
Map.Entry entry = (Map.Entry)myArray;
int currentInt = ((Integer)entry.getValue()).intValue();
if (currentInt == j)
print(""+(entry));
}
}
}
}

Tom7 · Oct 22, 2000

Square brackets seem to produce an undesired effect here.
The line in question should read

char c = buffer(i); // (square brackets)

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

QUESTION

xtolgax

Programmer

fenris

Programmer

fenris

Programmer

Tom7

Programmer

Tom7

Programmer

Similar threads

Part and Inventory Search

Sponsor