Recently I was given the task to analyse a range of files (around 300) and count the occurrence of all the words in each file. So the aim was to put together a piece of code that goes through all files in a directory, reads in a file, lists all the words occurring in it and counting how many time each word has occurred.
I quickly found out that in case of a single file the process is rather simple, the following code does a fine job,
for w in `cat FILE.txt`; do echo $w;done|sort|uniq -c >> results.out
This code reads in FILE.txt and for each word in it counts its occurrence and the creates a list from it.
However putting this into a recursive script was a little more complicated. So I took another direction and found a piece of code using sed to do the same job on a single file. With this and some scripting knowledge I was able to put together just what I needed.
Additionally, I used the command basename to output the name of the file so I know which file was which.
The final piece of code looks like this,
for file in `ls /PATH/TO/DIRECTORY/`
do
basename /PATH/TO/DIRECTORY/FILE >> results.out sed s/' '/\\n/g /PATH/TO/DIRECTORY/FILE | sort | uniq -c | sort -nr >> results.out
echo "" >> results.out
done
This does a perfect job and creates a single file with the output containing,
- File name of each file
- Occurrence of each word in the file, sorted from high to low
- Empty line to separate data from each other
If anyone has any other suggestion or comment I am happy to hear it!
No comments:
Post a Comment