Software Engineering

About the shaney program

In 1998 and 1999, students of Software Engineering were asked to write a program called shaney, which generates text randomly using probabilities from existing text. The program uses Markov chains (sequences of pseudo-random numbers which depends on the numbers which have gone before) to determine the probability that a given phrase is followed by a certain word. Here is how it works:

The input text is a sequence of words, for example,

Fear leads to anger, anger leads to hate, hate leads to suffering.

Shaney reads this input and builds a data structure showing that, for example, the two words "leads to" are followed 1/3 of the time by "anger," 1/3 of the time by "hate," and 1/3 of the time by "suffering." (For the purposes of this program, words are seperated by whitespace, so that punctuation is included in a word, and also that case is to be preserved, so "anger," is not the same word as "anger" and "Fear" is not the same word as "fear".)

For every sequence of two words in the input, the data structure remembers what single word follows those two, and how likely it is that combination will occur.

The first two words output by shaney will be the first two words of input, in this case "Fear leads". From there, the program uses random numbers to choose the next word. In this case, the only possible word which can follow is "to", since that is the only word in the input which occurs after the two words "Fear leads".

Shaney then advances by one word, and considers what might follow "leads to", for which there are 3 possibilities. The program uses a random number to choose one word, and continue. Eventually, some arbitrary time later, it will output "leads to suffering.", and since there is no word following the sequence "to suffering." the program stops. The output for this input might therefore be:

Fear leads to hate, hate leads to anger, anger leads to anger, anger leads to hate, hate leads to suffering.

By this algorithm, every three-word sequence in the output must have appeared in the input, but four-word and longer sequences will not have necessarily appeared in the input.

Supplied Code

In the directory ~cs3/se/style/   you will find a Makefile, and a number of source code files. There are a set of utilities which check memory allocation (the files zutil.c and zutil.h) and the file shaney.c contains the actual program. The textwrap.c file contains a program which places words on to a single line, wrapping at 80 characters.

The program compiles by typing make shaney, and produces an executable called shaney and this program reads one or more input files specified as command-line parameters and prints to standard output. A numerical parameter can be specified which tells the program the maximum number of words to be printed. So the program can be used in this way:

	shaney [-words 1000] filename1 [filename2] [...] 

The square brackets mean the parameter is optional, so if the -words option is specified it must be followed by an positive integer specifying the maximum number of words to output. Then one or more filenames may be specified, from which words are read to make up the probability data structure.

The output of the program can be piped through the supplied textwrap program to make it human-readable.

Integer random numbers are used to choose the next word at each stage in the algorithm. In stdlib.h are the functions rand and srand, and by including time.h we can write srand(time(NULL)) at the start of the program to force new random numbers to be generated from rand each time the program is run. (How would you test a program which produces random output?)

The design of the data structure shapes the program. It should be easy to build incrementally as new words are read from the input, and easy and fast to use when generating the output. For this reason a hash-table was used.

Study the source code and draw a diagram of how the data structure works. Understanding this program will help you in designing the other data structures you will need in later assignments.

The name shaney comes from the term Markov chain. The program has been used to post fake messages to the Usenet news groups under the name "Mark V. Shaney", and if you search the internet for that name you'll probably find a few instances of this fictional person.