Lab 2: Processing input files Due Date: Saturday 04/23/2011 12:01 am This lab is worth 20 points. Your program should read in the contents of some text files, count the words in each file, and produce a summary of which words appeared 10 or more times in all of the files combined. The definition of word for this lab is as we mentioned in class- \w+. These words should be separated by any non-word characters. So, you can safely split the input on \W if you like. One way to accomplish this, given some input line containing words/non-words is the following: my @words = split /\W+/,$line; Notes: 1. All of the requirements for the lab can be completed by reading through each file only once. 2. The programs should use the mechanism for reading from @ARGV as discussed in class and in chapter 6- a good example skeleton for this is: #!/usr/local/bin/perl -w use strict; # using this <> in this manner will treat all of the files mentioned # on the command line as one big textfile, and will read from them # in turn while (<>) { chomp; # do stuff with $_ such as the split command mentioned above } # do your printing of the appropriate words down here 3. Do not use any modules for this lab. (except for strict) 4. Do not use any system calls to unix utilities to do word counts or anything like that. 5. Simple error checking should be performed- for example, if there are no words in the file that appear more than 10 times, I would expect you to detect that and print a message about it. 6. Your program should treat all input as one file; I might put 2 input text files on the command line, and I might put 20- as far as your program is concerned, it doesn't matter how many I put there- it will just read using the mechanism discussed in class. Extra credit: 1. +3 Sort the output in increasing order by number of appearances for each word. The secondary sort criteria should be ascii-betical order. (you will need to read ahead for this one) 2. +4 If either of the files contains a line that starts with a # sign, and contains ONLY one word - i.e, if the file has: #fire #water then fire and water are special words. After your output of the words that appear greater than 10 times, also include special note of how many times each of the special words shows up. Spaces between the # and the word are allowed, as well as trailing space, but leading space is not allowed (so the # has to be the first character). Special words DO count towards the count of words in the file. So, if you have the following file: a a a a #a a a a a a a then you should output that a appeared 11 times, and was a special word. If you are not doing extra credit, a still appeared 11 times (since the # is not part of \w). 3. +1 Calculate a simple percentage of the total words in the file that were on your list of words that appeared greater than 10 times. Print this calculation out last, with some sort of explanatory text attached, for example: Percentage of frequently used words out of all words: 85.3% This means that if there were 100 total words in the file, and 6 of those words each appeared 10 times, then the percentage would be 60% (60/100). 4. +1 Calculate a simple percentage of the total words in the file that were special words- for example, if there were 100 different words, and 8 of them were also marked as special words at some point, then you should output 8% for this item.