Hatrack River Writers Workshop: Creativity = structured randomness?

This is topic Creativity = structured randomness? in forum Open Discussions About Writing at Hatrack River Writers Workshop.

To visit this topic, use this URL:
http://www.hatrack.com/ubb/writers/ultimatebb.php?ubb=get_topic;f=1;t=003387

Posted by trousercuit (Member # 3235) on :

So I'm taking a "natural language processing" course this semester, and we've just gone over n-gram language models. Because I'm seriously twisted, I decided to turn it into a right monster...

I've always had a hard time coming up with names. It occurred to me that I could have my computer do it using n-gram models. Here's some output from a program I hacked out in the last couple of hours:

quote:
$ ./ngram.py 5 25 firstnames.txt surnames.txt
Americk
Gisell
Destina
Marlo
Piercedes
Zanderson
Fernard
Mollin
Madelinda
Bryana
Lelani
Brenne
Felica
Gianca
Arace
Brenderson
Cynthias
Malakaila
Maximillan
Khalia
Hillar
Charley
Jamiro
Melandon
Belena
181 duplicates

"firstnames.txt" and "surnames.txt" are the training set - just files full of names. This is a 5-gram model. The output is 25 names where every 5-letter sequence is guaranteed to have the same probability as every 5-letter sequence in the training set. It does generate names in the training set (the 181 duplicates), but it doesn't display those.

"Piercedes" cracks me up. I'm going to use that one.

Posted by trousercuit (Member # 3235) on :

If anybody else wants to play around with this, I've finally got it solid. Go here:

http://axon.cs.byu.edu/~neil/ngram/

Download all the files. It's a command-line program, so there's no friendly GUI. You'll need Python installed, which is standard in Linux, and easy to install in Windows:

http://www.python.org/download/

Posted by Kolona (Member # 1438) on :

I don't know what you're talking about, but I really like some of those. I'm not sure I'd want to call a character "181 duplicates," though. (

Just kidding) I am kind of partial to "Lelani."

Posted by trousercuit (Member # 3235) on :

My daughter liked that one, too.

In more intuitive terms, it generates random words that sound kind of like the words you feed it. On top of that, there are many subtleties of its behavior that you'd only understand or even care about if you'd studied temporal probabilistic models.

Ahem. My inner nerd is showing through. Excuse me while I cover it up. Don't peek, now.

I fed it Alice in Wonderland and it churned out these delightful words:

quote:
aroundrence
Serpill
seatered
bream
forgetter
Turt
sudded
existerpill
flamping
musion
expent
onely
greathinkling
tillion
hisk
witnestillarm
sobster

And this was fun:

quote:
./ngram.py 5 10 slavic_surnames.txt japanese_surnames.txt
Nakanishida
Goncharoff
Shimaru
Umenov
Tashi
Hayashimamoto
Vernacki
Asakurai
Shige
Solovkin

I'll never hurt for a character name again. Yay computers!

Posted by MarkJCherry (Member # 3510) on :

Very interesting.

Posted by mikemunsil (Member # 2109) on :

that is very cool

Posted by franc li (Member # 3850) on :

If I ever were hurting for character names, I think I'd go with the list of "Charter Members of the official Lord of the Rings Fan Club" at the end of the ROTK extended DVD.

So have you done swear words yet?

Doesn't Luke whine to Uncle Owen about going into Tashi station to pick up some power converters?

[This message has been edited by franc li (edited October 18, 2006).]

Posted by Lynda (Member # 3574) on :

"friendly GUI"??? I'm lost, but I like the names you generated, and the words, too! So I'm gonna download what you said and hope a non-techie like me can figure it out. . . . Can you give us instructions in plain English if I need them??? Thanks for sharing!

Lynda

Posted by EricJamesStone (Member # 1681) on :

Fascinating tool. I can tell I'm going to be using this.

A suggestion: make it eliminate duplicates from the list it generates. When working with a limited data set, such as a list of the countries of the world, it generates quite a few duplicates:

Finlands
Island
Britania
Ukrain
Britania
Island
Swazil
Braziland
Armenistan
Senegro
Vatic
Colomon
Turkmenia
Angolia
Island
Turkmenia
Angolia
Turkmenia
Myanmark
Austral
Senegro
Armenistan
Nicaraguay
Myanmark
Belarussalam

Posted by EricJamesStone (Member # 1681) on :

If anyone wants to try it out, I've created a rudimentary web interface: http://randomplots.com/cgi-bin/ngram/ngram.pl

(Edited to put URL of new version.)

[This message has been edited by EricJamesStone (edited October 19, 2006).]

Posted by trousercuit (Member # 3235) on :

Dang, Eric. That's cool.

One problem with removing duplicates from the list is that it may run indefinitely on some inputs. If you make it generate characters with a 7-gram model and suppress duplicates in its own generated list, it'll generally keep trying the same names over and over. I'll probably just have it uniquify (that's a proper computer-sciencey term, I swear) the list before it spits it out.

I'll also add weights. Right now, if you feed it a list of 100 Slavic names and 1000 French names, it'll generate mostly French-sounding names. I'll have it pre-normalize the weights too, so it'll do what you expect without having to adjust the weights on your own. I've wanted to do that for a while now.

I'll post here again when the changes are up.

On your web page, it might make sense to call the first number the "gibberish factor," and limit it from 1 to 6. 1 is a unigram model, which just selects characters without regard to what came before. At higher ns it looks at the preceeding n-1 characters to figure out what to output next. The higher the number is, the more the words resemble actual words.

Heh. I might also do this with sentences. Imagine feeding it a James Joyce novel and having it spit out genuine James Joyce gems!

Posted by EricJamesStone (Member # 1681) on :

Those sound like some great ideas.

What I'm doing with the Perl script is basically just passing the parameters to your Python script and printing the results it gets back. So it's your script that's doing all the heavy lifting; I'm just putting a front end on it.

Unfortunately, I think the current masculine and feminine name files I added are too broad, because the web page I got them from was global in scope. I'd like to offer a choice of name files split up by language of origin. That would then allow for some cool combinations.

Posted by trousercuit (Member # 3235) on :

An idea I just had: It might help to "salt" the training algorithm with text from the language you're getting the names from. A list of names won't necessarily contain every valid phonetic sequence, especially when using 5-grams. So, for instance, you'd use a 1.0-weighted list of French given names and a 0.2-weighted Les Mis as training data.

I haven't tested this yet since I haven't got weights implemented (working on it!), but it might help produce more variety when using higher-order models.

Posted by oliverhouse (Member # 3432) on :

Okay, that's just really, really cool.

Eric, any chance you can modify the CGI to accept a list of words or names and use those instead of yours?

[This message has been edited by oliverhouse (edited October 19, 2006).]

Posted by EricJamesStone (Member # 1681) on :

> Eric, any chance you can modify the CGI to accept a list of
> words or names and use those instead of yours?

Hmm. The Python script only accepts filenames as parameters for word lists.

But it might be possible for me to create a temporary file with the submitted word list, pass the filename to the Python script, and then delete the file after the results are returned (so as to not clutter the server with custom word list files).

Posted by trousercuit (Member # 3235) on :

I've been playing with generating words from a list of synonyms for a word I'd like to invent a new synonym for, so a text box like that could be very useful. (I'm also salting the training set with about 0.1 of Alice in Wonderland, which seems to work well. For "stupid": "lackward," "backwitted," "ho-humpish," "grosaic," "scatty.") It's been ages since I did shell scripting, but I think it wouldn't be too hard to do that with a here-document. Probably want to put a character limit on it based on the maximum command-line length if there is one.

The new version is coming! I just need to add a command-line switch for words vs. sentences... but I've got some homework to do first.

Posted by EricJamesStone (Member # 1681) on :

OK, a slightly less rudimentary version is now available here: http://randomplots.com/cgi-bin/ngram/ngram.pl

It allows custom word lists.

Posted by trousercuit (Member # 3235) on :

I've updated the ngram model and added a bunch of text files:

http://axon.cs.byu.edu/~neil/ngram/

It'll do sentences with the "-s" switch, but it's really for entertainment value more than anything. Because of data sparsity (that whole "almost all sentences are unique" thing), it doesn't make much sense to go above a 3-gram with sentences. It also doesn't parse sentences very well in the first place.

Alice in Wonderland, 3-gram model:

quote:
`There's certainly too much pepper in that case I can find them.' As she said this, she came upon a heap of sticks and dry leaves, and the arm that was trickling down his cheeks, he went on, `I must be a great interest in questions of eating and drinking.

So if you've got an insane character and you need some dialogue, this'll fix you right up.

You can now put an integer or decimal weight after each input file. It pre-normalizes the weights, too: if you give it "scottish_surnames.txt japanese_surnames.txt" both Scottish and Japanese names will be well-represented in the output regardless of how many names each file contains. I've been doing things like this:

quote:
./ngram.py 5 10 synonyms_ugly.txt alice.txt 0.1
cantageous
mephitical
blemistic
fractionable
fraughter
debase
unwelcomely
snappetizing
blotten
dangersometimes

Posted by oliverhouse (Member # 3432) on :

I _love_ it.

Mix Scottish names with physics words:

aninstry
Dunderven
Cossic
Scotinum

Use them all in a sentence:

"Craig Dunderven, the Cossic expert in aninstry, discovered Scotinum -- in his closet, where his physicist wife had stored it."

Posted by starsin (Member # 4081) on :

Call me late and technologically retarded, but I can't figure out where and how to download this thing...>_<

Posted by starsin (Member # 4081) on :

Sorry for double post...but I figured out how to download, now I can't figure out how to use...dangit.

Posted by oliverhouse (Member # 3432) on :

You can quickly use the Web interface if you want. Pretty rudimentary, but it works.

http://randomplots.com/cgi-bin/ngram/ngram.pl

Posted by wbriggs (Member # 2267) on :

I know (somewhat) what an ngram is in stats, but... what is this program? How does it do what it does? I'm an AI type myself, but not NLP specifically.

Posted by trousercuit (Member # 3235) on :

I'll see if I can remember... I ended up dropping the course because of time constraints, so none of this is fresh.

We'll call a list of sequences of tokens a "corpus." (In this case, the tokens are letters.) An n-gram model is nothing more than an n-1th order Markov model that you induce ("train") from the corpus.

Let's say you're doing trigrams. That means you need a probability distribution P(token[i] | token[i-1], token[i-2]). To "train," you initialize a counter for each 3-token combination, loop over each sequence a token at a time, and for each three-token sequence, increment the counter. (You pad the beginning of each sequence with two start tokens, '<S>', and append a stop token, '</S>'.) Now that you've got token counts, you can normalize them to get P(token[i] | token[i-1], token[i-2]).

The function SimpleNGramModel.train() does this part. It also does all the lower-order n-grams as well.

So you've got probability distributions over tokens. Now what? Well, you can generate random token sequences and calculate how probable they are (the product of the probabilities of each token given the two before), but that could take a while. Instead, you do this:

1. Start with <S><S>, let i = 2.

2. Pick a token from P(Token | token[i-1], token[i-2]). If this is zero for all tokens, pick from P(Token | token[i-1]). If this is zero for all tokens, pick from P(Token). Assign the pick to token[i].

3. If token[i] == </S>, stop. Otherwise, i++, goto 2.

The generate() function does this part. The rest of the script just parses files and makes calls to generate().

[This message has been edited by trousercuit (edited December 15, 2006).]