Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

A Challenge... 2

Status
Not open for further replies.

duncdude

Programmer
Jul 28, 2003
1,979
0
0
GB
I have just been reading a debate as to whether it is best to learn C or C++.
While i am not too interested in which is best i am shocked by the amount of code required to perform a simple task

As this debate went on one individual lay down the gauntlet to try to prove that C++ was more compact and faster.

His challenge was this:-
Write a program that reads text from standard-input, and returns (prints) the total number of distinct words found in the text

The text file was 55,704,078 bytes in size and contains 6,727,317 words in total. It holds 576,059 distinct words

The web page is at should you be interested

I simply wondered how minimal and speedy the code could be from a UNIX scripting guru

Well - if any of you are up for the challenge!?
 
Can you please explain how the uniqueness is handled
Using the associative array capability of awk:
for(i=1;i<=NF;++i)++a[$i]
create an array a indexed by the current word counting the occurrence
for(i in a)++j
count the number of entries created in the array, thus the number of distnct words.

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886
 
@duncdude: You may have a brain the size of a small planet. But i am just not that bright. I have tried many times to get my head around C - bog standard, none of the fancy (++) OO methodology - and i can't believe how complex it is to even deal with strings of text.

My brain is small as a cats brain (but not that fast), but I have an excellent documentation: javadocs.

For text-issues like the above, I allways use the shell, and guess nobody reaches the development-speed of the shell with c, c++ or java.

seeking a job as java-programmer in Berlin:
 
I have a cat and he told me this:-

He told me that i am wrong to contrast Perl with Java (or C++) - and that it is OO methodology that i am having difficulty with. He reminded me that when i read the Learning Perl Objects, References & Modules - O'Reilly book that i nearly cried. I then remembered that ever since i was about 12 years old, when i first started programming in Basic on a Dragon 32 - a LONG time ago, that practically all my experience is with procedural languages. He is a wise cat. So i therefore apologise for making any comparisons.


Kind Regards
Duncan
 
locate mice | cat

lol! superb! :)


Kind Regards
Duncan
 
PHV said:
Can you please explain how the uniqueness is handled
Using the associative array capability of awk:
In Perl and Ruby, known as hashes.
for(i in a) ++j
count the number of entries created in the array, thus the number of distnct words.
You don't need this in Brian Kernighan's "One True Awk". You simply say length(a).
duncdude said:
I have tried many times to get my head around C - bog standard, none of the fancy (++) OO methodology - and i can't believe how complex it is to even deal with strings of text
I think the hardest thing about C is that it is so cluttered with pointers---and pointers to pointers. E.g.,
s = "Foo bar.";
assigns an address to s. If you want to master C, you probably should become very adept at handling pointers.
 
Hi futurelet

I (obviously) agree with your point about the pointers. Whilst i am perfectly aware i am not the only one who finds this concept difficult to get my head around (maybe because it is a world apart from a procedural language) - i just wonder if i am finding it more difficult than the average joe programmer. As i mentioned above, i started programming in BASIC at 12 years old, and made good progress then. In fact i reckon i haven't really made much of a leap since a few years after that. I am now 36. Maybe young brains are so much more able to adapt. I hit a kind of cieling in any language i look at now and feel strongly that all my code has a point that i can't get beyond. I have also 'dabbled' in

Pascal (Borland)
Postscript
C (UNIX)

I have been using Perl for years now & love it - i just find it so much easier to use than the others (obviously excluding BASIC). I mention this because i am frustrated at myself for finding it so difficult to take that next step - that which i feel is OO. Do you know what i mean?

Thank you for letting me rant on!


Kind Regards
Duncan
 
I've barely scratched the surface of OO programming, but the basic ideas seem pretty simple. Here are some examples from Ruby:
[tt]
p [['ram',88],['dog',99],['rat',13]].sort.flatten
--> ["dog", 99, "ram", 88, "rat", 13]
[/tt]
We send a message to the array asking it to sort itself, then another asking it to flatten itself. Of course, this won't work if arrays don't know how to sort themselves.
[tt]p [1,2,3,4].map{|x| x * 10 }
--> [10, 20, 30, 40]
[/tt]
Let's say we often multiply each item in an array by 10. Let's teach arrays a new method named "times10":
[tt]class Array
def times10
map{|x| x * 10 }
end
end

p [1,2,3,4].times10
--> [10, 20, 30, 40]
[/tt]

As an alternative to C, consider Eiffel. It's OO, and there's an open-source version called SmartEiffel.
 
I know this is an old topic, but I came across it while looking for something completely different. :) Anyway, is there a place I can download that txt file from? I'd like to give it a go on my Unix box with the simplest command I can think of (more file.txt |sort|uniq|wc -l) and see how long it takes. It's a rather dated box by today's standards, but I'd like to compare those times to a fully functioning C++ program as the poster listed in his web link. I'm just wondering if simplicity + power (I'd say this dated box runs rings around any pc for something like this) > some nice OO code.
 
Hi

I never seen that file. But is 53 Mb. When such amount of data is redirected through pipes ( three in your sample code ), will be slow on any machine. Beside the slowness, is possible to not be allowed to use as much resources as it needs.

Using command line utilities is just quick to develop, not to run.

Feherke.
 
Anyway I doubt that (more file.txt |sort|uniq|wc -l) works well.
Another way:
tr -cs '[:alnum:]' '[\012*]' < file.txt | sort -u | wc -l

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886
 
Hi feherke -

I didn't have duncdude's 53meg file, but I had a 56 meg file w/ approx. 630,000 lines in it that I ran on my server using my three pipes and it took less than 20 seconds (not much less, I couldn't read my dial close enough to tell if it was 18 or 19 seconds). So, I guess I answered my self, simplicity + power > 00 code in this case (I would assume though, that if I ran that same C++ code, if possible, on the same box, it would trump this). Again, my server is dated, but for tasks such as this, it will run rings around any workstation or pc that I've seen.
 
But your three pipes doesn't count words but lines !
 
PHV - Good call! I am so used to just counting lines. Anywho, I switched it to a wc -w and I just figured out that a cat is much faster than a more for some reason. So that same file came out to run in 11 seconds. Not shabby.
 
Again, you aren't counting distinct words, but words in distinct lines.

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886
 
Ha! I just can't win for losing today! Oh well, no more distractions for now. lol
 
Code:
asux ->~/proj/mini/forum > time cat 64gesamt.txt | java CountUniqueWords
11186

real    0m24.187s
user    0m19.305s
sys     0m0.612s
asux ->~/proj/mini/forum > time cat 64gesamt.txt | sh uniqueWordCount.sh
11185

real    0m32.480s
user    0m30.027s
sys     0m0.338s
I'm a bit surprised to see java winning.
Note the difference between the counted Words 11185 to 11186.
Source is /usr/src/linux/Documentation: cat *.txt > a.txt
cat a.txt a.txt > b.txt
and appending it until about 64 MB is used.

The shell-script uses feherkes solution.

seeking a job as java-programmer in Berlin:
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top