Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Python Challenge: Word Frequency Analysis

Status
Not open for further replies.

soni21

Programmer
Apr 25, 2023
9
IN
Write a Python function that takes a string as input and returns a dictionary where the keys are unique words in the string, and the values are the frequencies of each word. The function should be case-insensitive and should ignore punctuation.

For example:

Python:
def word_frequency_analysis(text):
    # Your code goes here
    pass

# Test the function
sample_text = "Python is a powerful, versatile programming language. Python is widely used for web development, data analysis, artificial intelligence, and more."
result = word_frequency_analysis(sample_text)
print(result)

The expected output should be something like:

Python:
{
    'python': 2,
    'is': 2,
    'a': 1,
    'powerful': 1,
    'versatile': 1,
    'programming': 1,
    'language': 1,
    'widely': 1,
    'used': 1,
    'for': 1,
    'web': 1,
    'development': 1,
    'data': 1,
    'analysis': 1,
    'artificial': 1,
    'intelligence': 1,
    'and': 1,
    'more': 1
}

Provide a concise and efficient Python code solution along with any explanations or considerations. Thank you!
 
Sounds like a classroom assignment.

Skip,

[glasses]Just traded in my OLD subtlety...
for a NUance![tongue]

"The most incomprehensible thing about the universe is that it is comprehensible" A. Einstein

You Matter...
unless you multiply yourself by the speed of light squared, then...
You Energy!
 
It's certainly homework.
When I started learning Python 20 years ago, this was a demonstration example in an introductory book on "What Are Dictionaries Good For?"
If you want us to help you with this, show us some code what have you tried so far and what doesn't work as you expected.
 
Because I was bored and haven't done anything with Python in a while.

Python:
def word_frequency_analysis(text):
    output = {}
    for word in text.lower().split():
        if word in output:
            output[word] = output[word] + 1 
        else:
            output[word] = 1

    return output

More compact, but perhaps less clear.

Python:
def word_frequency_analysis_2(text):
    output = {}
    for word in text.lower().split():  
        value = output[word] + 1 if word in output else 1
        output[word] = value
        
    return output
 
I realized that there is a punctuation problem with my original solutions. I cheated and found about the 'string' library online.

Helper function to strip out punctuation.

Python:
import string

def strip_punctuation(input_string):
    return input_string.translate(str.maketrans('', '', string.punctuation))

And implementing it.

Python:
def word_frequency_analysis_1(text):
    output = {}
    for word in strip_punctuation(text).lower().split():
        if word in output:
            output[word] = output[word] + 1 
        else:
            output[word] = 1

    return output

Python:
def word_frequency_analysis_2(text):
    output = {}
    for word in strip_punctuation(text).lower().split():  
        value = output[word] + 1 if word in output else 1
        output[word] = value
        
    return output

result = word_frequency_analysis_2(sample_text)

For good measure, one more way.

Python:
def word_frequency_analysis_3(text):
    output = {}
    for word in strip_punctuation(text).lower().split():
        try:
            output[word] = output[word] = 1
        except KeyError:
            output[word] = 1
            
    return output
 
Lastly, for kicks, which is fastest on a longer string?

Declaration of Independence said:
We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness. --That to secure these rights, Governments are instituted among Men, deriving their just powers from the consent of the governed, --That whenever any Form of Government becomes destructive of these ends, it is the Right of the People to alter or to abolish it, and to institute new Government, laying its foundation on such principles and organizing its powers in such form, as to them shall seem most likely to effect their Safety and Happiness. Prudence, indeed, will dictate that Governments long established should not be changed for light and transient causes; and accordingly all experience hath shewn, that mankind are more disposed to suffer, while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, pursuing invariably the same Object evinces a design to reduce them under absolute Despotism, it is their right, it is their duty, to throw off such Government, and to provide new Guards for their future security. --Such has been the patient sufferance of these Colonies; and such is now the necessity which constrains them to alter their former Systems of Government. The history of the present King of Great Britain is a history of repeated injuries and usurpations, all having in direct object the establishment of an absolute Tyranny over these States. To prove this, let Facts be submitted to a candid world.

[ol 1]
[li]word_frequency_analysis_1 took 4.990608100000827 seconds to run 100000 times[/li]

[li]word_frequency_analysis_2 took 5.240722300004563 seconds to run 100000 times[/li]

[li]word_frequency_analysis_3 took 9.201343699998688 seconds to run 100000 times[/li]
[/ol]
 
Bard's solution. After correcting some indent mistakes.

Code:
import re

def word_frequencies(text):
  """
  Bard's solution
  Returns a dictionary of word frequencies in a text string.

  Args:
    text: A string containing the text to analyze.

  Returns:
    A dictionary where the keys are unique words in the text, and the values
    are the frequencies of each word.
  """
  # Lowercase the text and remove punctuation.
  text = text.lower()
  text = re.sub(r"[^\w\s]", "", text)

  # Split the text into words and count their frequencies.
  words = text.split()
  word_counts = {}
  for word in words:
    if word in word_counts:
      word_counts[word] += 1
   
    else:
      word_counts[word] = 1

  return word_counts

8.129551300000458 seconds
 
mintjulep,

But when you have the function
Code:
def strip_punctuation(input_string):
    return input_string.translate(str.maketrans('', '', string.punctuation))
then applying it on the string "foo,bar;baz:spam/eggs." delivers
Code:
>>> strip_punctuation("foo,bar;baz:spam/eggs.")
'foobarbazspameggs'
which is not good, because then there is nothing to split()

IMO it would be better to use
Code:
def strip_punctuation(input_string):
    return input_string.translate(str.maketrans(string.punctuation, len(string.punctuation) * " ")
which applied on the same string delivers
Code:
>>> strip_punctuation("foo,bar;baz:spam/eggs.")
'foo bar baz spam eggs '
and then you can split() it.
 
One more improvement to return the dictionary in alphabetical order. Change the return statement to

Python:
return dict(sorted(output.items()))

The sort imposes a pretty big performance hit.

This returns a List of Tuples, which is faster, but doesn't meet the problem statement.
Depending on the downstream use.....

Python:
return sorted(output.items())

 
You did a nice job, but unfortunately there has been no feedback so far from soni21 who asked this question.
 
For my own interest.

Python:
def strip_with_string(input_string):
    return input_string.translate(str.maketrans(string.punctuation, len(string.punctuation) * " ")).split()

def strip_with_regex(input_string):
    return re.sub(r"[^\w\s]", " ", input_string).split()

a  = strip_with_string(declaration)

b = strip_with_regex(declaration)

print ( a == b)

T_string = timeit.timeit(lambda: strip_with_string(declaration), number=doit)
T_regex = timeit.timeit(lambda: strip_with_regex(declaration), number=doit)

print (f"Strip with string: {T_string}\nStrip with regex: {T_regex}")

Code:
True
Strip with string: 1.7044300000416115
Strip with regex: 4.372240500000771

The String library is much faster than regex for stripping out punctuation.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top