Stata/Natural Language Processing

Reading a text file
If lines are short (less than the 244 string characters), one can use insheet. This command will read the text file into Stata's memory. . insheet using toto.txt, clear

String functions
First have a look at the list of string functions already included in Stata. . h string functions

Regular Expressions
Stata includes commands for regular expressions regexm, regexr and regexs.


 * Regular Expression in Stata on UCLA STATA Website
 * Regular Expressions on STata website

Wordscores
Ken Benoit, Michael Laver and Will Lowe have developed wordscores, a set of Stata command which read textfiles, count the frequency of each word and compute some index of similarity between texts.


 * Wordscore page