Project 2

Due June 05

Intro:

This homework will attempt to answer the question, was Shakespeare really Sir Francis Bacon. To do this, we will write the following program.

  1. The program should ask for the two input files when it starts.
  2. The program should then read the text from the input files into some sort of data structure. The user should be given the option of using an AVL tree or a hash table.
  3. For each word that you read, you will want to do three things.
    1. Convert the words to lower case. You might find the String buffer class handy. Its like a string, but you can fiddle with the stuff inside of it. Some static functions from the Character class may also be useful.
    2. Only store the first 5 characters of any given word. We don't want the words caressed and caress to be thought of as two different words after all. This is called stemming.
    3. Remove all punctionation from a word.
    4. Finally, keep track of the number times each word occurs in its given file.
  4. Once you have the occurences of all the words, convert the occurances to a percentage. So, if and appeared 10 times in a file with 1000 words, and has an occurance of .1.
  5. For each word that is found in both texts, subtract their frequencies. Then square this difference and add it to a running total. As mentioned yesterday, this is equivalent to the square of geometric distance between the two vectors.
  6. Print out the result. (As a hint: doing the caclulations with a hash table and an AVL tree should not result in a different output. And beleive me, I will test this. )

The question then arises, what value of the output means that they are different authors, and what value means that they are the same. That leads us to our experiment.

  1. Richard Johnson has produced a set of texts that will help you. In it, you will find texts from Shakesphere, Balzac, and others.
  2. You will first want to find out what result you typically get for texts of the same authors. Compare hamlet to othello and macbeth. Compare othello to hamlet. Average the three results. This is roughly what you would expect the difference between shakesphere's texts to be.
  3. Then compare the new atlantis to macbeth, othello, and hamlet. Average that result.
  4. How does Bacon's average difference compare to shakepshere's? What does this tell us about the Bacon's authorship of the texts.
  5. The other text's are there just for fun. Let me know your thoughts in the email where you turn this program in.