Javanomicon01 - Case Study 6 - Flesch Readability Index

From Monkeys @ Keyboards
Jump to: navigation, search

Case Study 6 - Flesch Readability Index

Previous TOC Next
Javanomicon01 - File I/O 2 The Javanomicon Javanomicon01 - Sound and Vision

Introduction

In this chapter we're going to pull together two of the main themes of our first File IO chapter and build an application that reads in a Stream-Based file and parses it into a format suitable for manipulation. The scenario is that we are building an application to compute the Flesch Readability Index of any given text document.

The Flesch Readability Index (henceforth referred to as FRI) is a numerical representation of how easy (or difficult) it is to read a particular piece of text. We are going to write an application with a simple interface that allows us to select a file to be analysed, and then reads it into our program, parses and computes the readability index, and then outputs that number along with a string representation of the text.

The FRI is calculated according to the following formula:

index = 204.835 - 84.6 * (number of syllables in document / number of words in document)<br> - 1.015 * (number of words in document / number of sentences in document)

The higher the index, the more readable the document.. The lower, the more difficult the document is to read.

For the purposes of this application, we will define the number of syllables as being based on the number of vowels within words that separate consonants. Pairs of vowels and diphthongs are treated as a single vowel, and instances of the letter e at the end of a word are ignored. We will also be treating the letter y as a vowel.

For example:

  • JAvA - Two syllables
  • Is - One syllable
  • GrEAt - Two syllables
  • And - One syllable
  • ExcItIng - Three syllables

So, that's our scenario. Let's get cracking!

The View

The interface for this case study is very simple. We require a label or a text area for when we output the string representation of the file we opened. We need a button that will display a JFileChooser dialog, and then another button that allows us to calculate the FRI. We will be using the BorderLayout manager for this, so let's have a look at our storyboard:

21.1: Flesch Readability Storyboard

From this diagram, it should be a simple task to construct our interface code. We need to use File IO in this application, so we will have to make use of the free standing application structure, which means extending a JFrame as opposed to a JApplet.

In previous case studies we have had to draw out complicated diagrams that represented exactly where each component was going to go and how big it was going to be. When we discussed layout managers in chapter 14, we freed ourselves from this complicated routine,. This allows us to concentrate on where components should go rather than how they actually get there.

Our storyboard above shows a relationship between the main container, any panels, and the components. This alone is enough to allow us to construct the specified interface in code, as such:

import javax.swing.*;
import java.awt.*;
import java.awt.event.*;
 
public class FleschIndexInterface extends JFrame {
 
    JTextArea myText;
    JButton open, analyse;
    JPanel myButtons;
 
    public static void main(String args[]) {
        FleschIndexInterface mainWindow = new FleschIndexInterface();
        mainWindow.setSize(300, 300);
        mainWindow.setTitle("Flesch Readability Index");
        mainWindow.setVisible(true);
    }
 
    public FleschIndexInterface() {
        myText = new JTextArea(10, 10);
        open = new JButton("Open File");
        analyse = new JButton("Analyse File");
        myButtons = new JPanel();
        myButtons.setLayout(new FlowLayout());
        add(myText, BorderLayout.CENTER);
        myButtons.add(open);
        myButtons.add(analyse);
        add(myButtons, BorderLayout.SOUTH);
    }
}

Every day, in every way, we're getting better and better!

When we compile and run this application, we can see what our interface is going to look like:

21.2: Application Screenshot

It's not fancy, but it will do exactly what we need. In effect, we've completed the view part of our application. What is left now is the model and the controller.

The Controller

Our application looks the part, but it doesn't actually do anything yet. We don't even have any instances of ActionListener registered for our buttons. Step one is updating what we've written so far a little so that we're implementing ActionListener, providing an empty actionPerformed method, and registering ActionListener objects for each of the buttons:

import javax.swing.*;
import java.awt.*;
import java.awt.event.*;
 
public class FleschIndexInterface extends JFrame implements ActionListener {
 
    JTextArea myText;
    JButton open, analyse;
    JPanel myButtons;
 
    public static void main(String args[]) {
        FleschIndexInterface mainWindow = new FleschIndexInterface();
        mainWindow.setSize(300, 300);
        mainWindow.setTitle("Flesch Readability Index");
        mainWindow.setVisible(true);
    }
 
    public FleschIndexInterface() {
        myText = new JTextArea(10, 10);
        open = new JButton("Open File");
        open.addActionListener(this);
        analyse = new JButton("Analyse File");
        analyse.addActionListener(this);
        myButtons = new JPanel();
        myButtons.setLayout(new FlowLayout());
        add(myText, BorderLayout.CENTER);
        myButtons.add(open);
        myButtons.add(analyse);
        add(myButtons, BorderLayout.SOUTH);
    }
 
    public void actionPerformed(ActionEvent e) {
    }
}

Now that we have this basic framework, we can concentrate on the controller logic, which will all go into the actionPerformed method.

We don't have anything to compute the index yet, but we do know that when the open button is pressed we should flash up a JFileChooser dialog:

    public void actionPerformed(ActionEvent e) {
        JFileChooser myChooser;
        if (e.getSource() == open) {
            myChooser = new JFileChooser();
            myChooser.showOpenDialog(this);
        }
    }

We need to refine this slightly so that the showOpenDialog call is triggered within an if structure that makes sure the user clicked the APPROVE_OPTION button. If they did, we'll put their selected file into a File object that represents the user's choice. We need to import java.io for this part. The File variable declaration should be placed at the top of the file with the definitions for the components:


import javax.swing.*;
import java.awt.*;
import java.awt.event.*;
import java.io.*;
 
public class FleschIndexInterface extends JFrame implements ActionListener {
 
    JTextArea myText;
    JButton open, analyse;
    JPanel myButtons;
    File selectedFile;
    ...
 
    public void actionPerformed(ActionEvent e) {
        JFileChooser myChooser;
        if (e.getSource() == open) {
            myChooser = new JFileChooser();
            if (myChooser.showOpenDialog(this)
                    == JFileChooser.APPROVE_OPTION) {
                selectedFile = myChooser.getSelectedFile();
                System.out.println("The user selected file "
                        + selectedFile.getAbsolutePath());
            }
        }
    }

That's given us the framework we need to open a file and select a file... although as of yet we're not doing anything with it. The code for opening a file, reading its contents and parsing it is not really part of the controller component of this application - all that we should be providing at this point is the means for the user to interact with our model. We've allowed them to select their file, but we can't let them analyse it until we've written the code to do that.

With that in mind, onto the model!

The Model - File I/O

We're going to add a new object to our application - one that will take a File object as a constructor parameter and then analyse it according to the FRI. We'll call this class FleschComputer, because I think it sounds funny:

import java.io.*;
 
public class FleschComputer {
 
    File workingFile;
 
    public FleschComputer(File myFile) {
        workingFile = myFile;
    }
}

Within this class, we need to parse the provided file into a suitable string format, so we'll give ourselves another method called openFile that does just that.

We already have a file reference, so we can use that as the basis for our FileReader object:

FileReader myReader = new FileReader (workingFile);

And then this myReader object becomes the basis for our BufferedReader:

BufferedReader in = new BufferedReader (myReader);

And then we adopt the loop structure we discussed in our first chapter on File IO. We need a String variable that holds all of our input to date, and one that holds the current line. The line variable will be a local variable, and the allInput variable will be a class variable and so declared along with workingFile:

import java.io.*;
 
public class FleschComputer {
 
    File workingFile;
    String allInput = "";
 
    public FleschComputer(File myFile) {
        workingFile = myFile;
    }
 
    public void openFile() {
        String line = "";
        try {
            FileReader myReader = new FileReader(workingFile);
            BufferedReader in = new BufferedReader(myReader);
            line = in.readLine();
            while (line != null) {
                allInput = allInput + line + "\n";
                line = in.readLine();
            }
            in.close();
        } catch (IOException ex) {
            System.out.println("A horrible exception has occured!");
        }
 
    }
}

Now we have our data in a string format ready for manipulation, which means we no longer need our File object. Hrm... we really shouldn't have that as a class variable, since we only need it when we're opening a file.

Instead, we'll pass it as a parameter to our method, thus improving efficiency a little bit. We'll then call openFile directly from the constructor method:

    public FleschComputer(File myFile) {
        openFile(myFile);
    }
 
    public void openFile(File workingFile) {
        ...
    }

Now comes the difficult bit - parsing this string into a readability index.

The Model - Parsing

There are a number of things we need to work out in order to get our FRI value. Remember our calculation:

index = 204.835 - 84.6 * (number of syllables in document / number of words in document)<br> - 1.015 * (number of words in document / number of sentences in document)

So we need to find:

  • The number of syllables in the document
  • The number of words in the document
  • The number of sentences in the document

We'll provide a method that calculates each of these and returns the correct value as integer:

  • countSyllables
  • countWords
  • countSentences

Both the countWords and countSentences methods should be quite easy to code - a sentence ends with a full stop, an exclamation mark or a question mark. Actually, that's not quite right since it will mean ellipses (three full stops in a row) will be counted as three sentences. So the calculation is actually a little bit more complicated than that. First, let's deal with the easy bit:

    int countSentences() {
        char temp;
        int counter = 0;
        for (int i = 0; i < allInput.length(); i++) {
            temp = allInput.charAt(i);
            if (temp == '?' || temp == '!' || temp == '.') {
                counter += 1;
            }
        }
        return counter;
    }

Once we have the easy bit, we can worry about the ellipses. We know we have an ellipses (or some other non-standard sentence ending) when we have found a full stop and the character after that one is also a full stop... so we simply don't count any of these towards the number of sentences. First we make sure that there is a next character:

        if (i < allInput.length() - 1) {
        }

And then we store the next character if there is one in a variable called next:

next = allInput.charAt (i + 1);

This gives us the following structure:

        if (temp == '.') {
            if (i < allInput.length() - 1) {
                next = allInput.charAt(i + 1);
            }
            if (next == '.') {
                continue;
            }
        }

But that won't work quite right. Consider an ellipses located in the middle of a string (starting at position 11):

21.3: Ellipses in a String

We start at position 11, and find a full stop. Then we check the next character (at position 12), which is also a full stop. So we don't count it.

We move to position 12, and find a full stop. We check the next character (at position 13), which is also a full stop. We don't count that one either.

We move to position 13, and find a full stop. But the next character (at position 14) here isn't a full stop and so this character gets counted towards the number of sentences. Alas, this is not correct since it is still part of the same ellipses.

If there is no full stop as the next character, we also need to check to make sure there was no full stop as the previous character - only then can we be sure that ellipses are not counted as the end of a sentence:

    int countSentences() {
        char temp;
        char next = 0;
        char prev = 0;
        int counter = 0;
        for (int i = 0; i < allInput.length(); i++) {
            temp = allInput.charAt(i);
            if (temp == '?' || temp == '!') {
                counter += 1;
            }
            if (temp == '.') {
                if (i < allInput.length() - 1) {
                    next = allInput.charAt(i + 1);
                }
                if (next == '.') {
                    continue;
                }
                if (i > 0) {
                    prev = allInput.charAt(i - 1);
                }
                if (prev == '.') {
                    continue;
                }
                counter += 1;
            }
        }
        return counter;
    }

So, that's the code we need to ensure the number of sentences is calculated. Now let's think about the number of words. This is a similar procedure, except that we count spaces as being the indicator between words. We won't worry about non-standard documents where there are no spaces between punctuation and letters, for example:

And so,it seems,that this is quite difficult to read.I certainly wouldn't want to read this!

Our countWords method is going to be pretty simple:

    private int countWords() {
        int counter = 0;
        char temp;
        for (int i = 0; i < allInput.length(); i++) {
            temp = allInput.charAt(i);
            if (temp == ' ') {
                counter += 1;
            }
        }
        return counter;
    }

All this leaves us to do is calculate the number of syllables. This is a bit trickier.

We can do this in the same way we have done above, by simply crunching through every letter... but the logic for doing this is quite complicated when it starts relating to the letter e at the end of a word. Instead, we'll use a StringTokenizer, as we discussed way back when at the dawn of time (in chapter eight). We'll tokenize the input based on a space, and then calculate the number of syllables in each word.

First we need a method that tells us whether a letter is a vowel. This is a simple method:

    private boolean isVowel(char temp) {
        if (temp == 'a' || temp == 'e' || temp == 'i'
                || temp == 'o' || temp == 'u'
                || temp == 'y') {
            return true;
        }
        return false;
    }

And we would also benefit greatly from having a method that returned the number of syllables in a particular word. Let's start with a simple implementation of this that simply counts the number of vowels without any other considerations.

    private int countSyllablesInWord(String word) {
        int counter = 0;
        char temp;
        for (int i = 0; i < word.length(); i++) {
            temp = word.charAt(i);
            if (isVowel(temp)) {
                counter += 1;
            }
        }
        return counter;
    }

Then we can worry about discounting the letter e at the end of a word:

    private int countSyllablesInWord(String word) {
        int counter = 0;
        char temp;
        for (int i = 0; i < word.length(); i++) {
            temp = word.charAt(i);
            if (temp == 'e' && i == word.length() - 1) {
                continue;
            } else if (isVowel(temp)) {
                counter += 1;
            }
        }
        return counter;
    }

The next part is counting diphthongs as a single vowel... we use a similar technique for this as the one we used to avoid ellipses. All we need in this case though is to check the next letter - we want them to be counted at least once:

    private int countSyllablesInWord(String word) {
        int counter = 0;
        char temp;
        char next;
        for (int i = 0; i < word.length(); i++) {
            temp = word.charAt(i);
            if (i < word.length() - 1) {
                next = word.charAt(i + 1);
            } else {
                next = 0;
            }
            if (temp == 'e' && i == word.length() - 1) {
                continue;
            } else if (isVowel(temp) && !isVowel(next)) {
                counter += 1;
            }
        }
        return counter;
    }

Finally, we need a little bit at the end to make sure every word has at least one syllable. Just before we return from the method:

if (counter == 0) {
  counter = 1 ;
}

And then we need a method that calculates the number of syllables for each word and totals it all up. Here is where our StringTokenizer comes in. We need to import java.util to get the tokenizer, but once we have it we are simply brimming over with Cosmic Power:

    private int countSyllables() {
        int counter = 0;
        String token;
        StringTokenizer myToken = new StringTokenizer(allInput, " ");
        while (myToken.hasMoreTokens()) {
            token = myToken.nextToken();
            counter += countSyllablesInWord(token.toLowerCase());
        }
        return counter;
    }

And then, we put it all together!

We have our formula, we have the methods that compute each element of it, so we'll have a final method: computeIndex that will return the reading index for this piece of text. This is going to be the interface to our class - the developer will create an instance passing in a File object, and then call computeIndex to get the FRI out of it. There is no need for the developer to ensure that the methods we coded above are chained together in an appropriate way:

    public double computeIndex() {
        int sentences = countSentences();
        int syllables = countSyllables();
        int words = countWords();
        double index;
        index = 205.835 - 84.6 * (syllables / words) - 1.015 * (words / sentences);
        return index;
    }

However, this won't work quite correctly due to the way Java converts between ints and doubles... we need to give Java some guidance to help it come up with the correct calculation by casting all our ints to doubles:

    public double computeIndex() {
        int sentences = countSentences();
        int syllables = countSyllables();
        int words = countWords();
        double index;
        index = 205.835 - (84.6 * ((double) syllables
                / (double) words))
                - (1.015
                * ((double) words / (double) sentences));
        return index;
    }

And having done this, we then want to hook it back into our application. When we press the analyse button, we want to create an instance of our FleschComputer class and then call computeIndex:

    public void actionPerformed(ActionEvent e) {
        JFileChooser myChooser;
        FleschComputer myCompute;
        if (e.getSource() == open) {
            myChooser = new JFileChooser();
            if (myChooser.showOpenDialog(this)
                    == JFileChooser.APPROVE_OPTION) {
                selectedFile = myChooser.getSelectedFile();
            }
        } else {
            if (selectedFile == null) {
                JOptionPane.showMessageDialog(null,
                        "You need to choose a file first!");
            } else {
                myCompute = new FleschComputer(selectedFile);
                JOptionPane.showMessageDialog(null,
                        "The Flesch index is "
                        + myCompute.computeIndex());
            }
        }
    }

All that's left to do is display the text we parsed... we have no method for getting that text as yet, but we simply add a getText method to our FleschComputer:

public String getText() {
  return allInput ;
}

And then when we're done computing, we call that method and put the returned value into our text area:

        myCompute = new FleschComputer(selectedFile);
        myText.setText(myCompute.getText());
        JOptionPane.showMessageDialog(null, "The Flesch index is " + myCompute.computeIndex());

And that's us finished the core application!

Revisions and Tweaks

There are a number of things we could do with our FleschComputer to improve it from an object oriented standpoint. For one thing, our application will only read a file from disk... but we have a nice text area there that we might want to enter things into for a quick analysis of their readability. We can handle that easily by adding an overloaded constructor method:

    private String allInput = "";
 
    public FleschComputer(File myFile) {
        openFile(myFile);
    }
 
    public FleschComputer(String myText) {
        allInput = myText;
    }

It's a good idea to make our class as general as possible to aid in reusability. Requiring file IO means that we restrict its use to applications - it's possible though that we might want to also provide a facility for applets to make use of the FleschComputer, and so a range of constructor methods are preferable.

We could also benefit from providing a range of utility methods that provide information in a range of formats. Being able to compute the index itself is fine, b ut that requires that we understand what the numbers represent. Maybe we could also provide a method that computes the index but returns a meaningful string instead of a cold, impersonal number:

    public String getTextDescription() {
        int index = (int) computeIndex();
        if (index >= 100) {
            return "The text is very easy to read.";
        } else if (index >= 80) {
            return "The text is quite easy to read.";
        } else if (index >= 60) {
            return "The text is moderately easy to read.";
        } else if (index >= 40) {
            return "The text is not particularly easy to read.";
        } else if (index >= 20) {
            return "The text is of university level difficulty";
        } else {
            return "The document is almost impenetrable.";
        }
    }

Returning to the idea of more constructors being better, we could also provide a constructor method that allowed the user to analyse a web-page for readability - we discussed how to do that in our chapter on file IO.

The difficult with web-pages is with the large number of special control tags that are used - these are invisible to the reader, but would still be counted in our readability score. That's not an ideal situation.

However, we can apply what we learned about regular expressions in chapter 8.9 and get rid of them before we analyse. An HTML tag is contained between a pair of angle braces, so we can use the following regular expression to match them:

<.*>

We can feed that into our replace method which will remove all instances of the tags:

        allInput = allInput.replaceAll("<.*>", "");

It'll even get rid of closing tags! In fact, running it will show a problem - it will get rid of everything in the document! Alas!

This is because the star metacharacter is greedy... it's not satisfied when it finds the first matching closing bracket... it continues all the way through the document until it finds the last closing bracket. Consider a simple example:

  <html>
	<head>
		<title>Example</title>
	</head>
	<body>
		This is an example HTML file.
	</body>
</html>

It finds the first matching element of the regular expression:

  It finds a match here <html>
	<head>
		<title>Example</title>
	</head>
	<body>
		This is an example HTML file.
	</body>
</html>

And then continues through the document matching every character (including closing brackets) to the period character. It's only when it finds the last bracket it will interpret it as a match:

It finds a match here <html>
	<head>
		<title>Example</title>
	</head>
	<body>
		This is an example HTML file.
	</body>
</html>It finds the closing match here.

To deal with this, we need to mark the star character as lazy - we do this with a ? symbol. A lazy expression will be happy with the first match it finds:


        allInput = allInput.replaceAll("<.*?>", "");

Our replace method call won't get rid of all the stray characters, but it'll get rid of enough of them. The rest are left as an exercise for the reader.

We can now provide a third constructor that allows for our computer to work on remote HTML resources:

    public FleschComputer(URL myURL) {
        String line = "";
        URLConnection myConn = null;
        InputStream is = null;
        InputStreamReader myRead = null;
        BufferedReader in = null;
        try {
            myConn = myURL.openConnection();
            is = myConn.getInputStream();
        } catch (Exception ex) {
        }
        myRead = new InputStreamReader(is);
        in = new BufferedReader(myRead);
        try {
            line = in.readLine();
            while (line != null) {
                allInput = allInput + line;
                line = in.readLine();
            }
            in.close();
        } catch (Exception ex) {
        }
        allInput = allInput.replaceAll("<.*?>", "");
    }

This harkens back to the idea of file parsing that we discussed briefly in chapter 12. There is no standard mechanism for parsing files... it's an application dependant process. A good grounding in string parsing is required to open up the doors of functionality that are otherwise closed to us.

Conclusion

As you can see from this particular application, the file access part is only one small aspect. We read the file into a string, and then we have to do all the Back-Breaking Labour of actually parsing it into useful information. This is true for most applications involving file IO - it's really just a slightly more complex variable.

This particular example reintroduced is to the StringTokenizer we discussed way back in chapter 8. The way we use this tokenizer is identical regardless of whether the original string came from a file or a JTextArea.

Spending time practising string parsing will have an immediate effect on how easily you are able to parse stream based IO files into a useful format.