Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

A Challenge... 2

Status
Not open for further replies.

duncdude

Programmer
Jul 28, 2003
1,979
0
0
GB
I have just been reading a debate as to whether it is best to learn C or C++.
While i am not too interested in which is best i am shocked by the amount of code required to perform a simple task

As this debate went on one individual lay down the gauntlet to try to prove that C++ was more compact and faster.

His challenge was this:-
Write a program that reads text from standard-input, and returns (prints) the total number of distinct words found in the text

The text file was 55,704,078 bytes in size and contains 6,727,317 words in total. It holds 576,059 distinct words

The web page is at should you be interested

I simply wondered how minimal and speedy the code could be from a UNIX scripting guru

Well - if any of you are up for the challenge!?
 
Hi

Good question. I think the base movement should be something like this, but the speed could be a real problem.

Code:
sed 's/[^[:alpha:]]/\n/g' | grep . | sort -u | wc -l

Feherke.
 
That's the spirit... and just the sort of solution i was hoping to see! And much more condensed than the 121 line C++ solution!

Good work Feherke!

A * for you


Kind Regards
Duncan
 
I simply wondered how minimal and speedy the code could be from a UNIX scripting guru

You can get small code, but the problem will be the time. Each command invoked will add overhead, and each program called from the commandline will have to create it's own datastructure to hold the data to perform it's operation before passing it along. This will also cause it to use more memory. Moreover, if they had used a hash map in the C++ version the number of lines and time would have been better... but it's not an STL standard yet.

[plug=shameless]
[/plug]
 
Hi jstreich

They mention hashes all through the article - have they not used a hash map - i wouldn't know as i'm not a C programmer!?

my, probably poor, quick Perl attempt:-

Code:
[b]#!/usr/bin/perl[/b]

open (IN, "< words.txt");

while (<IN>) { $uniq{lc $_}++ foreach (split(/\W+/)) }

close IN;

[red]while (($key, $value) = each %uniq) {
  print "$key\t$value\n";
}[/red]


Kind Regards
Duncan
 
I'm refering to the suggested hash map standard, they are creating their own hash table from scratch, hence the need of the arrays and defitinition of their own hash function.

[plug=shameless]
[/plug]
 
This should with mawk or gawk:
Code:
## Count unique words in file
BEGIN{RS="[^A-Za-z]+"}
$0 && !a[$0]++ {c++}
END{print c}
However, I don't know what will happen when you try to put 576,059 entries in an associative array.
 
One awk way:
awk '{for(i=1;i<=NF;++i)++a[$i]}END{for(i in a)++j;print j}' words.txt

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886
 
PHV said:
One awk way:
awk '{for(i=1;i<=NF;++i)++a[$i]}END{for(i in a)++j;print j}' words.txt
I'm assuming that a "word" is a sequence of letters.
This code would count each of these lines as only 1 word:

helter-skelter
awk95.exe
shouldn't

And it would count this as 2 words:

100, 101

Expanding the code:
Code:
{ for (i=1;i<=NF;++i)
  ++a[$i]
}
END{
  for(i in a)
    ++j
  print j
}
Note that a value such as a[k] is never used;
therefore ++a[$i] could simply be a[$i], since Awk creates an entry when it is referenced.
 
futurelet, have you read the C source from the challenge site ?
A word is defined by scanf("%s", buf)

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886
 
A word is defined by scanf("%s", buf)

Not being fluent in C, I ask: does this mean that the following line would count as 1 "word"?

``Afterwards---no!''

If so, this benchmark is suprisingly crude.
 
I ask: does this mean ...
As you post in an unix forum I suggest you:
man scanf

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886
 
PHV (and any others who are wondering), here's how scanf("%s",buf) reads from stdin: it skips any leading whitespace, then copies characters to the given address until it encounters whitespace.

This means that this:

``Afterwards---no!---don't!''

is counted by the C++ programmer as one long "word" that begins with `` and ends with ''.
 
PHV (and any others who are wondering), here's how scanf("%s",buf) reads from stdin: it skips any leading whitespace, then copies characters to the given address until it encounters whitespace.

This means that this

``Afterwards---no!---don't!''

is counted by the C++ programmer as one long "word" beginning with `` and ending with ''.
 
@duncdude: "That's the spirit... and just the sort of solution i was hoping to see! And much more condensed than the 121 line C++ solution!"

And guess in which language sed, grep, sort and wc are written, and how many lines that is?


seeking a job as java-programmer in Berlin:
 
stefan - i think you are missing the point. I know perfectly (or can make an educated guess) as to what language the tools are written in. However, this was simply a bit of fun. I was simply trying to reinforce the beauty of the UNIX command-line tools - and how they can be used to solve a task both speedily & concisely. Look at PHV's solution.

awk '{for(i=1;i<=NF;++i)++a[$i]}END{for(i in a)++j;print j}' words.txt

Beautiful! It's tiny!!! I just think, for mortals, that C (and variants of) can be a crazily complex language - possibly that some adopt to solve a task - without understanding how much of a 'sledgehammer to crack a walnut' it really is. And, as the originator of the post comments, he didn't even understand how the last (faster) solution worked! This is scary. I am glad i am investing my time in Perl and UNIX command-line utilities, and for the most part, i have never had a problem i can't solve using them


Kind Regards
Duncan
 
PHV

I gave your awk script a go - blimey! Can you please explain how the uniqueness is handled. I understand hashes within Perl - just can't get my head around the awk script!

while (<IN>) { [red]$uniq{lc $_}[/red]++ foreach (split(/\W+/)) }

A [red]*[/red] for you for that solution!


Kind Regards
Duncan
 
In case anyone is interested...

867mhz G4 OS X Panther - 4 mins 31 seconds
Dual 2.5 G5 OS X Panther - 2 mins 2 seconds
Dual 2.5 G5 OS X Tiger - 2 mins 4 seconds

67 mb source file
6,727,317 words
approx. 260,500 unique finds


Kind Regards
Duncan
 
@duncdude: I totally agree on the power of unix-command-line-tools.
But if those tools are written in .c, it should be possible to generate a c-file:
Code:
int main (int arc, char argv**)
{
        int lines = wc ("-l", sort ("-u", (grep (".", sed ("s/[^[:alpha:]]/\n/g", argv[1]))));
        cprintf ("%i", lines);
        return 0;
}
shouldn't it?

Of course nobody likes to compile over and over again, and write trivial wrappers around the functions used (#include, int main, printf, return).

But the the power of the commandline depends on the programs installed, and the power of c/ c++ depends on the libraries.

If you write the appropriate library, the c++ solution will be a one-liner too.

In Java you get a solution of 17 lines, by using standard-libraries:
Code:
import java.io.*;
import java.util.*;

public class CountUniqueWords
{
	public static void main (String args[]) throws IOException
	{
		BufferedReader isr = new BufferedReader (new InputStreamReader (System.in));
		Set <String> set = new HashSet <String> ();
		String line;
		while ( (line = isr.readLine ()) != null)
		{
			set.addAll (Arrays.asList (line.split ("[^a-zA-Z]")));
		}
		System.out.println (set.size ());
	}
}


seeking a job as java-programmer in Berlin:
 
Hi Stefan

You may have a brain the size of a small planet. But i am just not that bright. I have tried many times to get my head around C - bog standard, none of the fancy (++) OO methodology - and i can't believe how complex it is to even deal with strings of text. Most problems, as far as i am concerned, can be tackled with a far more friendly language like Perl. I love Perl. It makes me happy. I thoroughly enjoy writing scripts in it. Believe me - i am not a moron - but i have tried dozens of times to get to grips with C... and it usually makes me want to cry/rip my hair out. This is annoying as i am determined to be able to rattle off small solutions using C in the near future. This is why i spend time on the Perl & UNIX scripting forums - because they can elegantly solve problems. In fact, if i had the choice of being a masterful C programmer, or equally adept within Perl, i'd probably choose C - simply for its purity & the fact it doesn't need to be interpreted. But i am simply not clever enough...


Kind Regards
Duncan
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top