UbuntuHak: Count the Occurence of All Words in a Number of Text Files

Saturday, November 1, 2014

Count the Occurence of All Words in a Number of Text Files

Recently I was given the task to analyse a range of files (around 300) and count the occurrence of all the words in each file. So the aim was to put together a piece of code that goes through all files in a directory, reads in a file, lists all the words occurring in it and counting how many time each word has occurred.

I quickly found out that in case of a single file the process is rather simple, the following code does a fine job,

for w in `cat FILE.txt`; do echo $w;done|sort|uniq -c >> results.out

This code reads in FILE.txt and for each word in it counts its occurrence and the creates a list from it.

However putting this into a recursive script was a little more complicated. So I took another direction and found a piece of code using sed to do the same job on a single file. With this and some scripting knowledge I was able to put together just what I needed.

Additionally, I used the command basename to output the name of the file so I know which file was which.

The final piece of code looks like this,

for file in `ls /PATH/TO/DIRECTORY/`
do
basename /PATH/TO/DIRECTORY/FILE >> results.out sed s/' '/\\n/g /PATH/TO/DIRECTORY/FILE | sort | uniq -c | sort -nr >> results.out
echo "" >> results.out
done

This does a perfect job and creates a single file with the output containing,

File name of each file
Occurrence of each word in the file, sorted from high to low
Empty line to separate data from each other

If anyone has any other suggestion or comment I am happy to hear it!

Pages

Saturday, November 1, 2014

Count the Occurence of All Words in a Number of Text Files

No comments:

Post a Comment