2.17.2007

Experimenting with words

Jeff Clark has an interesting post at Neoformix on graphing word frequencies. Though the technique is admittedly a first start, it does offer a fairly good gestalt of overall content.
The process:
  1. Convert text into a stream of words
  2. Throw away 'stop words' [aka "function words"] (e.g., a, the, and, of, by)
  3. Count the frequency of remaining words
  4. Draw ovals for each word, scaled to reflect its frequency
  5. Connect words appearing consecutively in the text (not counting 'stop words')
  6. Discard ovals of words appearing fewer than 9 times
  7. Position ovals with a spring-embedded algorithm
What I find most interesting is that with the possible exception of Step 5, each is easily accomplished using common or freely available software (e.g., MS Word and Excel for text processing, Pajek for network visualization).

Addendum: Actually, the entire process is easily replicated; most steps can be partly automated using "Search and Replace" and COUNTIF functions in MS Word and Excel, respectively. After isolating high-freq words, bold them in the original document and visually scan for adjacency, deleting "extraneous" (i.e., low-freq and non-adjacent) terms. I found the graphics were easier in UCInet, an inexpensive SNA tool that is both appropriate for smaller networks and easier to use.

In discussing this with a colleague, we noted that it might be useful to define "adjacency" differently (e.g., sentence or paragraph). This would be easy to accomplish with visual scans, though clearly has scaling limits. However, if you convert the original document such that "adjacency units" become separate rows, it is possible to combine Excel's COUNTIF and SEARCH functions to produce word frequencies. If the first column is the original text, you can count up to 255 separate strings across the others. A bit of creativity, and you can automatically identify specific pairings.

While the network graphics are interesting, I suspect it would be more useful to compare measures such as centrality and geodesic distance between specific terms across multiple documents.

No comments: