tell me how many specific words there are in the file?

asd1234qwerty · Mar 18, 2006

i'm trying to get the user to enter a word, then the script should search for how many of that word exists in the file. For example, I want it to say, "There are 7 'and' words in the file", or "There are 5 'the' words in the file" etc...

I have the following which I used to find a specific character, but I need to change this round to find words instead.

tr -dc $char < text | wc -c

Thanks for your help.

p5wizard · Mar 18, 2006

use this to convert spaces and tabs to newlines
tr ' \t' '\n\n'

use this to count the number occurances of the word 'word'
grep -c "^word$"

just pipe these 2 together

HTH,

p5wizard

asd1234qwerty · Mar 18, 2006

thanks for your reply. Would this work in the same way to find the number of sentences in the text by separating any text after a full stop onto a new line and then using grep to count the number of lines with a full stop? thanks again.

futurelet · Mar 18, 2006

Code:

ruby -e 'word=ARGV.shift
puts gets(nil).split.grep(word).size' and file.txt

p5wizard · Mar 19, 2006

Would this work in the same way to find the number of sentences in the text by separating any text after a full stop onto a new line and then using grep to count the number of lines with a full stop?

If you first convert newlines to space and then convert a period to a newline, you sort of count the number of sentences. I'd say sort of, because a sentence might contain an URL like

http://www.tek-tips.com

and that would count as 2 sentences. So you'd need to take into account that a period in normal English text is followed by a space character like in this text. Also you might want to count question marks? And what exclamation points!

So I would use tr to replace '!? ' into '..\n' (all q-marks and exc-pts become periods and spaces are turned into newlines, then grep for '\.$' to count the number of last words in all sentences, and you'd have the number of sentences too. As '.' has a special meaning for grep (any char wildcard) you need to escape this special meaning:

tr '!? ' '..\n'|grep -c '\.$'

and then

HTH,

p5wizard

stefanwagner · Mar 19, 2006

Don't forget about abbrev. like etc. pp.
And this is another rule: A colon terminates a sentence too.
And if the text contains direct speach, anybody will cry "What the **** did I get in? It's a hard job.", because here a sentence terminates with a dot, but isn't followed by a blank.
And the last sentence might end with a dot, followed by nothing at all.

seeking a job as java-programmer in Berlin:

http://home.arcor.de/hirnstrom/bewerbung

p5wizard · Mar 19, 2006

Well, without a definition of "sentence", I just started freewheeling. And apparently a colon is a terminator for an independent clause of a sentence, but not for the sentence itself...

HTH,

p5wizard

stefanwagner · Mar 19, 2006

Don't you start the first word after a colon with uppercase in UK/ US?

Why?

I thought I could transfer my knowledge of the german language to english in this case. I'm sorry if I got it wrong.

seeking a job as java-programmer in Berlin:

http://home.arcor.de/hirnstrom/bewerbung

chipperMDW · Mar 19, 2006

Don't you start the first word after a colon with uppercase in UK/ US?

Not unless you're making a mistake. I've seen many people make that mistake, though.

In the common Unix convention, a sentence ends with a period (or exclamation/question mark) that is followed by either two spaces or the end of a line. This allows you to differentiate between periods used in abbreviations and periods used to end sentences. Interestingly, this means you're not allowed to write "Mr." at the end of a line.

p5wizard · Mar 19, 2006

Stefanwagner said:
I thought I could transfer my knowledge of the german language to english in this case. I'm sorry if I got it wrong.

Well, if I'm not mistaken, you Germans start every Noun with an uppercase Letter also, and you can't transfer that Rule into the English Language either.

HTH,

p5wizard

p5wizard · Mar 19, 2006

chipperMDW said:
In the common Unix convention, a sentence ends with a period (or exclamation/question mark) that is followed by either two spaces or the end of a line. This allows you to differentiate between periods used in abbreviations and periods used to end sentences. Interestingly, this means you're not allowed to write "Mr." at the end of a line.

I wasn't aware of this 2-space rule, but adhering to this rule, consider this:

Code:

egrep -c '[\.\?\!]  |[\.\?\!]$' /path/to/textfile

This should count the number of sentence breaks in the middle of a text line plus the number of sentence breaks at the end of a text line. I'm not sure you need the backquote escape for egrep on all three punctuation characters: period, question mark and exclamation point, but it can't hurt.

HTH,

p5wizard

chipperMDW · Mar 20, 2006

I think you can get away with only backslashing the period.

However, that command won't work when two or more sentences end on the same line; there will be at least one occurrence of a sentence end, so the line will match and just be counted once.

My grep (GNU) has a [tt]-o[/tt] option to only show the matched output, and it prints each matching part on its own line. I think this is a GNU extension, but if your grep has that option, you can do:

Code:

egrep -o '[\.?!]  |[\.?!]$' $INPUT |wc -l

e.g.

Code:

[b]$ [/b]egrep -o '[\.?!]  |[\.?!]$' |wc -l
[i]This is a sentence.
This is a second sentence.  This is a third.
This fourth sentence
spans more than
one line.  The fifth one
does the same.[/i]
[b]5[/b]

Regarding the two-space convention: it's an old tradition, and only a few tools really care about it. Emacs uses it to fill text, [tt]fmt[/tt] uses it to reformat text, and [tt]troff[/tt] takes it into account in producing its output. Those are the only ones I know about.

Now... time to count paragraphs ;-)

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

tell me how many specific words there are in the file?

asd1234qwerty

Programmer

p5wizard

IS-IT--Management

asd1234qwerty

Programmer

futurelet

Programmer

p5wizard

IS-IT--Management

stefanwagner

Programmer

p5wizard

IS-IT--Management

stefanwagner

Programmer

chipperMDW

Programmer

p5wizard

IS-IT--Management

p5wizard

IS-IT--Management

chipperMDW

Programmer

Similar threads

Part and Inventory Search

Sponsor