The process:
- Convert text into a stream of words
- Throw away 'stop words' [aka "function words"] (e.g., a, the, and, of, by)
- Count the frequency of remaining words
- Draw ovals for each word, scaled to reflect its frequency
- Connect words appearing consecutively in the text (not counting 'stop words')
- Discard ovals of words appearing fewer than 9 times
- Position ovals with a spring-embedded algorithm
Addendum: Actually, the entire process is easily replicated; most steps can be partly automated using "Search and Replace" and COUNTIF functions in MS Word and Excel, respectively. After isolating high-freq words, bold them in the original document and visually scan for adjacency, deleting "extraneous" (i.e., low-freq and non-adjacent) terms. I found the graphics were easier in UCInet, an inexpensive SNA tool that is both appropriate for smaller networks and easier to use.
In discussing this with a colleague, we noted that it might be useful to define "adjacency" differently (e.g., sentence or paragraph). This would be easy to accomplish with visual scans, though clearly has scaling limits. However, if you convert the original document such that "adjacency units" become separate rows, it is possible to combine Excel's COUNTIF and SEARCH functions to produce word frequencies. If the first column is the original text, you can count up to 255 separate strings across the others. A bit of creativity, and you can automatically identify specific pairings.
While the network graphics are interesting, I suspect it would be more useful to compare measures such as centrality and geodesic distance between specific terms across multiple documents.
No comments:
Post a Comment