Lab 5: Javadoc and IO


  1. Javadoc

    Augment your classes from Lab 4 (DenseMultiSet and DenseFrequencyLibrary) with Javadoc comments. These comments should follow the Sun guidelines "How to Write Doc Comments for the Javadoc Tool". (See the resources listed under the class web page for a link to this document.)

    You should submit both your Java code, augmented with the Javadoc comments, as well as the html files generated by the javadoc tool.

  2. Distribution Plagiarist

    In this part of the lab, you will write a program that generates random phrases based on a source text. The algorithm for generating these phrases is explained below. Your solution should use the DenseFrequencyLibrary class you wrote for Lab 4.

    Basic Frequency Distribution of Characters

    Consider a lengthy piece of text, for example a novel. Each letter appears with a certain frequency. For example, 5% of the characters might be 'e's, while 0.3% of the characters might 'z'. The distribution of frequencies is a characteristic property of the given piece of text. This property is represented by the function Freq (where c is a character):
     Freq(c) = # c / total # of characters

    Different pieces of text can have the same distribution of frequencies. In fact, given a distribution of individual character frequencies, one could generate a new, random, phrase with a similar distribution. One would repeatedly append a random character, where the probability of picking any one character is equal to its relative frequency in the original text. Such a procedure, in a sense, plagiarizes from the original text since it copies the original text's distribution of characters to form a phrase that is then passed off as original.

    Of course, the random phrase resulting from such a procedure is likely to be gibberish. While 5% of the characters in the original text might be 'e's, they are not equally likely to appear at any point. For example, having seen a period followed by a space, it is highly unlikely that the next character is a lower-case 'e'. Conversly, having seen a space, followed by a 't' then an 'h', it is more likely that the next character is an 'e'. Hence, a characteristic frequency distribution of individual characters should reflect the context in which they appear. This observation leads to the generalization described in the next section.

    Generalized Frequency Distribution of Characters

    One can generalize the notion of character frequency to the frequency of characters following a given string, or key, of a fixed length (say k). This property is represented by the function KeyFreq (where str is a string of length k and c is a character):
     KeyFreq(str,c) = # concat(str,c) / total # concat(str,?)

    To illustrate the difference, consider a piece of text that is 2000 characters long and that consists entirely of 'a's and 'b's. The first half of the text is all 'a's, and the second half is all 'b's. The basic character frequency distribution is 50-50. But the key string based character frequency distribution, for substrings of length 2, has three different key strings: "aa", "ab", and "bb". The individual character probabilities for these key strings are given in the following table.

    Key String Prob of 'a' Prob of 'b'
    "aa" .999 .001
    "ab" 0 1
    "bb" 0 1

    Note that this profile distribution is parameterized by the length of the key string. For a given length, k, all substrings of that length in the original text must be found and the character immediately following each instance identified. For example, for key strings of length 3, there might be 100 occurrences of the substring " th", 75 of which are followed by 'e' and 25 of which are followed by 'r'. In other words, it is 3 times more likely that the substring " th" is followed by 'e' than by 'r'.

    A random phrase can be generated using this frequency distribution in much the same way as before. First, a seed key is identified by choosing a random substring from the text of the correct length. Then, a random character is appended based on the key-based character distribution from the original document. Now a new key is formed by dropping the first character from the previous key (i.e., from the front) and appending the newly generated random character (i.e., to the end). This forms a new substring of the same length and the process is repeated until the desired length of phrase is generated.

    Examples

    Below are some examples generated in this manner. Each example gives the source used as original text along with the key length that was used.

    Source Key Length Random Phrase
    Bush-Kerry Debate, 2004 6 Now, you have a reply for OB/GYNs. The fact is vital, by the person who's accustomed to accept any mistakes. I am -- I have to get rid of sanctuary to make enriched uranium, while Iranians that simple that.
    Bush-Kerry Debate, 2004 6 We ought that when, but just yesterday that we had we joined the Internets (sic) and from over America, but let me tell us we don't think I'm rights. That's just say: Hey, we created by your right out of the great nexus. The only cause the National TV. (LAUGHTER)
    Tom Sawyer 7 Many men were on the old man to ask him. Tom fled on a centre of the time. The boy, he musing his high and by began to wipe his lantern that passes were soon of the Sheriff "was concerns with supper--at least. He had to see Huck shoulders
    Alice in Wonderland 7 `Seals, turtles, salmon, and she's the same age as himself as she could possibly hear you! You see, Miss, this he handed over to the Gryphon. `Do you think me at home,' though she knew the riddles.--I believe it,' said the Cat

    Requirements

    Your program should take 4 command-line arguments (in this order):

    1. The length of key to use.
    2. The length of phrase to generate.
    3. The number of phrases to generate.
    4. The name of the file containing the original source whose character distribution will be plagiarized.

    You may call your program anything you like. If you name your program DistributionPlagiarist, I would expect to run your program with a command line like:

      % java DistributionPlagiarist 6 300 1 littleprince.txt
    

    This command line should result in one phrase, 300 characters long, being generated based on key strings of length 6. Your program should check the command line arguments that it receives to make sure they conform to reasonable requirements (e.g., the length of phrase must be non-negative, the text file must exist and be readable). It is up to you to choose requirements that make sense. Your program should provide error messages or thorough usage information if these requirements are violated.

    Observations

    The following observations and suggestions may be helpful in completing this lab.