Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

tell me how many specific words there are in the file?

Status
Not open for further replies.

asd1234qwerty

Programmer
Mar 18, 2006
5
GB
i'm trying to get the user to enter a word, then the script should search for how many of that word exists in the file. For example, I want it to say, "There are 7 'and' words in the file", or "There are 5 'the' words in the file" etc...

I have the following which I used to find a specific character, but I need to change this round to find words instead.

tr -dc $char < text | wc -c

Thanks for your help.
 
use this to convert spaces and tabs to newlines
tr ' \t' '\n\n'

use this to count the number occurances of the word 'word'
grep -c "^word$"

just pipe these 2 together

HTH,

p5wizard
 
thanks for your reply. Would this work in the same way to find the number of sentences in the text by separating any text after a full stop onto a new line and then using grep to count the number of lines with a full stop? thanks again.
 
Code:
ruby -e 'word=ARGV.shift
puts gets(nil).split.grep(word).size' and file.txt
 
Would this work in the same way to find the number of sentences in the text by separating any text after a full stop onto a new line and then using grep to count the number of lines with a full stop?

If you first convert newlines to space and then convert a period to a newline, you sort of count the number of sentences. I'd say sort of, because a sentence might contain an URL like and that would count as 2 sentences. So you'd need to take into account that a period in normal English text is followed by a space character like in this text. Also you might want to count question marks? And what exclamation points!

So I would use tr to replace '!? ' into '..\n' (all q-marks and exc-pts become periods and spaces are turned into newlines, then grep for '\.$' to count the number of last words in all sentences, and you'd have the number of sentences too. As '.' has a special meaning for grep (any char wildcard) you need to escape this special meaning:

tr '!? ' '..\n'|grep -c '\.$'

and then



HTH,

p5wizard
 
Don't forget about abbrev. like etc. pp.
And this is another rule: A colon terminates a sentence too.
And if the text contains direct speach, anybody will cry "What the **** did I get in? It's a hard job.", because here a sentence terminates with a dot, but isn't followed by a blank.
And the last sentence might end with a dot, followed by nothing at all.

seeking a job as java-programmer in Berlin:
 
Well, without a definition of "sentence", I just started freewheeling. And apparently a colon is a terminator for an independent clause of a sentence, but not for the sentence itself...


HTH,

p5wizard
 
Don't you start the first word after a colon with uppercase in UK/ US?

Why?

I thought I could transfer my knowledge of the german language to english in this case. I'm sorry if I got it wrong.

seeking a job as java-programmer in Berlin:
 
Don't you start the first word after a colon with uppercase in UK/ US?
Not unless you're making a mistake. I've seen many people make that mistake, though.


In the common Unix convention, a sentence ends with a period (or exclamation/question mark) that is followed by either two spaces or the end of a line. This allows you to differentiate between periods used in abbreviations and periods used to end sentences. Interestingly, this means you're not allowed to write "Mr." at the end of a line.
 
Stefanwagner said:
I thought I could transfer my knowledge of the german language to english in this case. I'm sorry if I got it wrong.

Well, if I'm not mistaken, you Germans start every Noun with an uppercase Letter also, and you can't transfer that Rule into the English Language either.


HTH,

p5wizard
 
chipperMDW said:
In the common Unix convention, a sentence ends with a period (or exclamation/question mark) that is followed by either two spaces or the end of a line. This allows you to differentiate between periods used in abbreviations and periods used to end sentences. Interestingly, this means you're not allowed to write "Mr." at the end of a line.

I wasn't aware of this 2-space rule, but adhering to this rule, consider this:

Code:
egrep -c '[\.\?\!]  |[\.\?\!]$' /path/to/textfile

This should count the number of sentence breaks in the middle of a text line plus the number of sentence breaks at the end of a text line. I'm not sure you need the backquote escape for egrep on all three punctuation characters: period, question mark and exclamation point, but it can't hurt.


HTH,

p5wizard
 
I think you can get away with only backslashing the period.

However, that command won't work when two or more sentences end on the same line; there will be at least one occurrence of a sentence end, so the line will match and just be counted once.

My grep (GNU) has a [tt]-o[/tt] option to only show the matched output, and it prints each matching part on its own line. I think this is a GNU extension, but if your grep has that option, you can do:
Code:
egrep -o '[\.?!]  |[\.?!]$' $INPUT |wc -l

e.g.
Code:
[b]$ [/b]egrep -o '[\.?!]  |[\.?!]$' |wc -l
[i]This is a sentence.
This is a second sentence.  This is a third.
This fourth sentence
spans more than
one line.  The fifth one
does the same.[/i]
[b]5[/b]


Regarding the two-space convention: it's an old tradition, and only a few tools really care about it. Emacs uses it to fill text, [tt]fmt[/tt] uses it to reformat text, and [tt]troff[/tt] takes it into account in producing its output. Those are the only ones I know about.


Now... time to count paragraphs ;-)
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top