Using Freqdist And Writing To Csv

December 24, 2023 Post a Comment

I'm trying to use nltk and pandas to find the top 100 words from another csv and list them on a new CSV. I am able to plot the words but when I print to CSV I get word | count 52

Solution 1:

Here you go. The code is quite compressed, so feel free to expand if you like.

First, ensure the source file is actually a CSV file (i.e. comma separated). I copied/pasted the sample text from the question into a text file and added commas (as shown below).

Breaking the code down line by line:

Read the CSV into a DataFrame
Extract the text column and flatten into a string of words, and tokenise
Pull the 100 most common words
Write the results to a new CSV file

Code:

import pandas as pd
from nltk import FreqDist, word_tokenize

df = pd.read_csv('./SECParse3.csv')
words = word_tokenize(' '.join([line for line in df['text'].to_numpy()]))
common = FreqDist(words).most_common(100)
pd.DataFrame(common, columns=['word', 'count']).to_csv('words_out.csv', index=False

Sample Input:

filename,text
AAL_0000004515_10Q_20200331,generally industry may affected
AAL_0000004515_10Q_20200331,material decrease demand international air travel
AAPL_0000320193_10Q_2020032,february following initial outbreak virus china
AAP_0001158449_10Q_20200418,restructuring cost cost primarily relating early

Output:

word,count
cost,2
generally,1
industry,1
may,1
affected,1
material,1
decrease,1
...

Python Playground

Using Freqdist And Writing To Csv

Solution 1:

Code:

Sample Input:

Output:

Post a Comment for "Using Freqdist And Writing To Csv"