Python Challenge: Word Frequency Analysis

soni21 · Dec 14, 2023

Write a Python function that takes a string as input and returns a dictionary where the keys are unique words in the string, and the values are the frequencies of each word. The function should be case-insensitive and should ignore punctuation.

For example:

Python:

def word_frequency_analysis(text):
    # Your code goes here
    pass

# Test the function
sample_text = "Python is a powerful, versatile programming language. Python is widely used for web development, data analysis, artificial intelligence, and more."
result = word_frequency_analysis(sample_text)
print(result)

The expected output should be something like:

Python:

{
    'python': 2,
    'is': 2,
    'a': 1,
    'powerful': 1,
    'versatile': 1,
    'programming': 1,
    'language': 1,
    'widely': 1,
    'used': 1,
    'for': 1,
    'web': 1,
    'development': 1,
    'data': 1,
    'analysis': 1,
    'artificial': 1,
    'intelligence': 1,
    'and': 1,
    'more': 1
}

Provide a concise and efficient Python code solution along with any explanations or considerations. Thank you!

mintjulep · Dec 14, 2023

Bard and chat GOT are pretty good at Python.

SkipVought · Dec 14, 2023

Sounds like a classroom assignment.

Skip,
_{Just traded in my OLD subtlety...

for a NUance!}
"The most incomprehensible thing about the universe is that it is comprehensible" A. Einstein

You Matter...
unless you multiply yourself by the speed of light squared, then...
You Energy!

mikrom · Dec 14, 2023

It's certainly homework.
When I started learning Python 20 years ago, this was a demonstration example in an introductory book on "What Are Dictionaries Good For?"
If you want us to help you with this, show us some code what have you tried so far and what doesn't work as you expected.

mintjulep · Dec 15, 2023

Because I was bored and haven't done anything with Python in a while.

Python:

def word_frequency_analysis(text):
    output = {}
    for word in text.lower().split():
        if word in output:
            output[word] = output[word] + 1 
        else:
            output[word] = 1

    return output

More compact, but perhaps less clear.

Python:

def word_frequency_analysis_2(text):
    output = {}
    for word in text.lower().split():  
        value = output[word] + 1 if word in output else 1
        output[word] = value
        
    return output

mintjulep · Dec 15, 2023

I realized that there is a punctuation problem with my original solutions. I cheated and found about the 'string' library online.

Helper function to strip out punctuation.

Python:

import string

def strip_punctuation(input_string):
    return input_string.translate(str.maketrans('', '', string.punctuation))

And implementing it.

Python:

def word_frequency_analysis_1(text):
    output = {}
    for word in strip_punctuation(text).lower().split():
        if word in output:
            output[word] = output[word] + 1 
        else:
            output[word] = 1

    return output

Python:

def word_frequency_analysis_2(text):
    output = {}
    for word in strip_punctuation(text).lower().split():  
        value = output[word] + 1 if word in output else 1
        output[word] = value
        
    return output

result = word_frequency_analysis_2(sample_text)

For good measure, one more way.

Python:

def word_frequency_analysis_3(text):
    output = {}
    for word in strip_punctuation(text).lower().split():
        try:
            output[word] = output[word] = 1
        except KeyError:
            output[word] = 1
            
    return output

mintjulep · Dec 15, 2023

Lastly, for kicks, which is fastest on a longer string?

Declaration of Independence said:
We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness. --That to secure these rights, Governments are instituted among Men, deriving their just powers from the consent of the governed, --That whenever any Form of Government becomes destructive of these ends, it is the Right of the People to alter or to abolish it, and to institute new Government, laying its foundation on such principles and organizing its powers in such form, as to them shall seem most likely to effect their Safety and Happiness. Prudence, indeed, will dictate that Governments long established should not be changed for light and transient causes; and accordingly all experience hath shewn, that mankind are more disposed to suffer, while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, pursuing invariably the same Object evinces a design to reduce them under absolute Despotism, it is their right, it is their duty, to throw off such Government, and to provide new Guards for their future security. --Such has been the patient sufferance of these Colonies; and such is now the necessity which constrains them to alter their former Systems of Government. The history of the present King of Great Britain is a history of repeated injuries and usurpations, all having in direct object the establishment of an absolute Tyranny over these States. To prove this, let Facts be submitted to a candid world.

[ol 1]
[li]word_frequency_analysis_1 took 4.990608100000827 seconds to run 100000 times[/li]

[li]word_frequency_analysis_2 took 5.240722300004563 seconds to run 100000 times[/li]

[li]word_frequency_analysis_3 took 9.201343699998688 seconds to run 100000 times[/li]
[/ol]

mintjulep · Dec 15, 2023

Bard's solution. After correcting some indent mistakes.

Code:

import re

def word_frequencies(text):
  """
  Bard's solution
  Returns a dictionary of word frequencies in a text string.

  Args:
    text: A string containing the text to analyze.

  Returns:
    A dictionary where the keys are unique words in the text, and the values
    are the frequencies of each word.
  """
  # Lowercase the text and remove punctuation.
  text = text.lower()
  text = re.sub(r"[^\w\s]", "", text)

  # Split the text into words and count their frequencies.
  words = text.split()
  word_counts = {}
  for word in words:
    if word in word_counts:
      word_counts[word] += 1
   
    else:
      word_counts[word] = 1

  return word_counts

8.129551300000458 seconds

mikrom · Dec 15, 2023

mintjulep,

But when you have the function

Code:

def strip_punctuation(input_string):
    return input_string.translate(str.maketrans('', '', string.punctuation))

then applying it on the string "foo,bar;baz:spam/eggs." delivers

Code:

>>> strip_punctuation("foo,bar;baz:spam/eggs.")
'foobarbazspameggs'

which is not good, because then there is nothing to split()

IMO it would be better to use

Code:

def strip_punctuation(input_string):
    return input_string.translate(str.maketrans(string.punctuation, len(string.punctuation) * " ")

which applied on the same string delivers

Code:

>>> strip_punctuation("foo,bar;baz:spam/eggs.")
'foo bar baz spam eggs '

and then you can split() it.

mintjulep · Dec 15, 2023

@mikrom

Thanks for the improvement.

mintjulep · Dec 15, 2023

One more improvement to return the dictionary in alphabetical order. Change the return statement to

Python:

return dict(sorted(output.items()))

The sort imposes a pretty big performance hit.

This returns a List of Tuples, which is faster, but doesn't meet the problem statement.
Depending on the downstream use.....

Python:

return sorted(output.items())

mikrom · Dec 17, 2023

You did a nice job, but unfortunately there has been no feedback so far from soni21 who asked this question.

mintjulep · Dec 18, 2023

For my own interest.

Python:

def strip_with_string(input_string):
    return input_string.translate(str.maketrans(string.punctuation, len(string.punctuation) * " ")).split()

def strip_with_regex(input_string):
    return re.sub(r"[^\w\s]", " ", input_string).split()

a  = strip_with_string(declaration)

b = strip_with_regex(declaration)

print ( a == b)

T_string = timeit.timeit(lambda: strip_with_string(declaration), number=doit)
T_regex = timeit.timeit(lambda: strip_with_regex(declaration), number=doit)

print (f"Strip with string: {T_string}\nStrip with regex: {T_regex}")

Code:

True
Strip with string: 1.7044300000416115
Strip with regex: 4.372240500000771

The String library is much faster than regex for stripping out punctuation.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Python Challenge: Word Frequency Analysis

soni21

Programmer

mintjulep

Technical User

SkipVought

Programmer

mikrom

Programmer

mintjulep

Technical User

mintjulep

Technical User

mintjulep

Technical User

mintjulep

Technical User

mikrom

Programmer

mintjulep

Technical User

mintjulep

Technical User

mikrom

Programmer

mintjulep

Technical User

Similar threads

Part and Inventory Search

Sponsor