Hatrack River Writers Workshop: Creativity = structured randomness?

my profile login | search | faq | forum home

»	Hatrack River Writers Workshop » Forums » Open Discussions About Writing » Creativity = structured randomness?

Author

Topic: Creativity = structured randomness?

trousercuit
Member
Member # 3235

posted

So I'm taking a "natural language processing" course this semester, and we've just gone over n-gram language models. Because I'm seriously twisted, I decided to turn it into a right monster...

I've always had a hard time coming up with names. It occurred to me that I could have my computer do it using n-gram models. Here's some output from a program I hacked out in the last couple of hours:

quote:
$ ./ngram.py 5 25 firstnames.txt surnames.txt
Americk
Gisell
Destina
Marlo
Piercedes
Zanderson
Fernard
Mollin
Madelinda
Bryana
Lelani
Brenne
Felica
Gianca
Arace
Brenderson
Cynthias
Malakaila
Maximillan
Khalia
Hillar
Charley
Jamiro
Melandon
Belena
181 duplicates

"firstnames.txt" and "surnames.txt" are the training set - just files full of names. This is a 5-gram model. The output is 25 names where every 5-letter sequence is guaranteed to have the same probability as every 5-letter sequence in the training set. It does generate names in the training set (the 181 duplicates), but it doesn't display those.

"Piercedes" cracks me up. I'm going to use that one.

Posts: 453 | Registered: Feb 2006 | IP: Logged |

trousercuit
Member
Member # 3235

posted

If anybody else wants to play around with this, I've finally got it solid. Go here:

http://axon.cs.byu.edu/~neil/ngram/

Download all the files. It's a command-line program, so there's no friendly GUI. You'll need Python installed, which is standard in Linux, and easy to install in Windows:

http://www.python.org/download/

Posts: 453 | Registered: Feb 2006 | IP: Logged |

Kolona
Member
Member # 1438

posted

I don't know what you're talking about, but I really like some of those. I'm not sure I'd want to call a character "181 duplicates," though. (

Just kidding) I am kind of partial to "Lelani."

Posts: 1810 | Registered: Jun 2002 | IP: Logged |

trousercuit
Member
Member # 3235

posted

My daughter liked that one, too.

In more intuitive terms, it generates random words that sound kind of like the words you feed it. On top of that, there are many subtleties of its behavior that you'd only understand or even care about if you'd studied temporal probabilistic models.

Ahem. My inner nerd is showing through. Excuse me while I cover it up. Don't peek, now.

I fed it Alice in Wonderland and it churned out these delightful words:

quote:
aroundrence
Serpill
seatered
bream
forgetter
Turt
sudded
existerpill
flamping
musion
expent
onely
greathinkling
tillion
hisk
witnestillarm
sobster

And this was fun:

quote:
./ngram.py 5 10 slavic_surnames.txt japanese_surnames.txt
Nakanishida
Goncharoff
Shimaru
Umenov
Tashi
Hayashimamoto
Vernacki
Asakurai
Shige
Solovkin

I'll never hurt for a character name again. Yay computers!

Posts: 453 | Registered: Feb 2006 | IP: Logged |

MarkJCherry
Member
Member # 3510

posted

Very interesting.

Posts: 52 | Registered: Jun 2006 | IP: Logged |

mikemunsil
Member
Member # 2109

posted

that is very cool

Posts: 2710 | Registered: Jul 2004 | IP: Logged |

franc li
Member
Member # 3850

posted

If I ever were hurting for character names, I think I'd go with the list of "Charter Members of the official Lord of the Rings Fan Club" at the end of the ROTK extended DVD.

So have you done swear words yet?

Doesn't Luke whine to Uncle Owen about going into Tashi station to pick up some power converters?

[This message has been edited by franc li (edited October 18, 2006).]

Posts: 366 | Registered: Sep 2006 | IP: Logged |

Lynda
Member
Member # 3574

posted

"friendly GUI"??? I'm lost, but I like the names you generated, and the words, too! So I'm gonna download what you said and hope a non-techie like me can figure it out. . . . Can you give us instructions in plain English if I need them??? Thanks for sharing!

Lynda

Posts: 415 | Registered: Jul 2006 | IP: Logged |

EricJamesStone
Member
Member # 1681

posted

Fascinating tool. I can tell I'm going to be using this.

A suggestion: make it eliminate duplicates from the list it generates. When working with a limited data set, such as a list of the countries of the world, it generates quite a few duplicates:

Finlands
Island
Britania
Ukrain
Britania
Island
Swazil
Braziland
Armenistan
Senegro
Vatic
Colomon
Turkmenia
Angolia
Island
Turkmenia
Angolia
Turkmenia
Myanmark
Austral
Senegro
Armenistan
Nicaraguay
Myanmark
Belarussalam

Posts: 1517 | Registered: Jul 2003 | IP: Logged |

EricJamesStone
Member
Member # 1681

posted

If anyone wants to try it out, I've created a rudimentary web interface: http://randomplots.com/cgi-bin/ngram/ngram.pl

(Edited to put URL of new version.)

[This message has been edited by EricJamesStone (edited October 19, 2006).]

Posts: 1517 | Registered: Jul 2003 | IP: Logged |

trousercuit
Member
Member # 3235

posted

Dang, Eric. That's cool.

One problem with removing duplicates from the list is that it may run indefinitely on some inputs. If you make it generate characters with a 7-gram model and suppress duplicates in its own generated list, it'll generally keep trying the same names over and over. I'll probably just have it uniquify (that's a proper computer-sciencey term, I swear) the list before it spits it out.

I'll also add weights. Right now, if you feed it a list of 100 Slavic names and 1000 French names, it'll generate mostly French-sounding names. I'll have it pre-normalize the weights too, so it'll do what you expect without having to adjust the weights on your own. I've wanted to do that for a while now.

I'll post here again when the changes are up.

On your web page, it might make sense to call the first number the "gibberish factor," and limit it from 1 to 6. 1 is a unigram model, which just selects characters without regard to what came before. At higher ns it looks at the preceeding n-1 characters to figure out what to output next. The higher the number is, the more the words resemble actual words.

Heh. I might also do this with sentences. Imagine feeding it a James Joyce novel and having it spit out genuine James Joyce gems!

Posts: 453 | Registered: Feb 2006 | IP: Logged |

EricJamesStone
Member
Member # 1681

posted

Those sound like some great ideas.

What I'm doing with the Perl script is basically just passing the parameters to your Python script and printing the results it gets back. So it's your script that's doing all the heavy lifting; I'm just putting a front end on it.

Unfortunately, I think the current masculine and feminine name files I added are too broad, because the web page I got them from was global in scope. I'd like to offer a choice of name files split up by language of origin. That would then allow for some cool combinations.

Posts: 1517 | Registered: Jul 2003 | IP: Logged |

trousercuit
Member
Member # 3235

posted

An idea I just had: It might help to "salt" the training algorithm with text from the language you're getting the names from. A list of names won't necessarily contain every valid phonetic sequence, especially when using 5-grams. So, for instance, you'd use a 1.0-weighted list of French given names and a 0.2-weighted Les Mis as training data.

I haven't tested this yet since I haven't got weights implemented (working on it!), but it might help produce more variety when using higher-order models.

Posts: 453 | Registered: Feb 2006 | IP: Logged |

oliverhouse
Member
Member # 3432

posted

Okay, that's just really, really cool.

Eric, any chance you can modify the CGI to accept a list of words or names and use those instead of yours?

[This message has been edited by oliverhouse (edited October 19, 2006).]

Posts: 671 | Registered: May 2006 | IP: Logged |

EricJamesStone
Member
Member # 1681

posted

> Eric, any chance you can modify the CGI to accept a list of
> words or names and use those instead of yours?

Hmm. The Python script only accepts filenames as parameters for word lists.

But it might be possible for me to create a temporary file with the submitted word list, pass the filename to the Python script, and then delete the file after the results are returned (so as to not clutter the server with custom word list files).

Posts: 1517 | Registered: Jul 2003 | IP: Logged |

trousercuit
Member
Member # 3235

posted

I've been playing with generating words from a list of synonyms for a word I'd like to invent a new synonym for, so a text box like that could be very useful. (I'm also salting the training set with about 0.1 of Alice in Wonderland, which seems to work well. For "stupid": "lackward," "backwitted," "ho-humpish," "grosaic," "scatty.") It's been ages since I did shell scripting, but I think it wouldn't be too hard to do that with a here-document. Probably want to put a character limit on it based on the maximum command-line length if there is one.

The new version is coming! I just need to add a command-line switch for words vs. sentences... but I've got some homework to do first.

Posts: 453 | Registered: Feb 2006 | IP: Logged |

EricJamesStone
Member
Member # 1681

posted

OK, a slightly less rudimentary version is now available here: http://randomplots.com/cgi-bin/ngram/ngram.pl

It allows custom word lists.

Posts: 1517 | Registered: Jul 2003 | IP: Logged |

trousercuit
Member
Member # 3235

posted

I've updated the ngram model and added a bunch of text files:

http://axon.cs.byu.edu/~neil/ngram/

It'll do sentences with the "-s" switch, but it's really for entertainment value more than anything. Because of data sparsity (that whole "almost all sentences are unique" thing), it doesn't make much sense to go above a 3-gram with sentences. It also doesn't parse sentences very well in the first place.

Alice in Wonderland, 3-gram model:

quote:
`There's certainly too much pepper in that case I can find them.' As she said this, she came upon a heap of sticks and dry leaves, and the arm that was trickling down his cheeks, he went on, `I must be a great interest in questions of eating and drinking.

So if you've got an insane character and you need some dialogue, this'll fix you right up.

You can now put an integer or decimal weight after each input file. It pre-normalizes the weights, too: if you give it "scottish_surnames.txt japanese_surnames.txt" both Scottish and Japanese names will be well-represented in the output regardless of how many names each file contains. I've been doing things like this:

quote:
./ngram.py 5 10 synonyms_ugly.txt alice.txt 0.1
cantageous
mephitical
blemistic
fractionable
fraughter
debase
unwelcomely
snappetizing
blotten
dangersometimes

Posts: 453 | Registered: Feb 2006 | IP: Logged |

oliverhouse
Member
Member # 3432

posted

I _love_ it.

Mix Scottish names with physics words:

aninstry
Dunderven
Cossic
Scotinum

Use them all in a sentence:

"Craig Dunderven, the Cossic expert in aninstry, discovered Scotinum -- in his closet, where his physicist wife had stored it."

Posts: 671 | Registered: May 2006 | IP: Logged |

starsin
Member
Member # 4081

posted

Call me late and technologically retarded, but I can't figure out where and how to download this thing...>_<

Posts: 117 | Registered: Oct 2006 | IP: Logged |

starsin
Member
Member # 4081

posted

Sorry for double post...but I figured out how to download, now I can't figure out how to use...dangit.

Posts: 117 | Registered: Oct 2006 | IP: Logged |

oliverhouse
Member
Member # 3432

posted

You can quickly use the Web interface if you want. Pretty rudimentary, but it works.

http://randomplots.com/cgi-bin/ngram/ngram.pl

Posts: 671 | Registered: May 2006 | IP: Logged |

wbriggs
Member
Member # 2267

posted

I know (somewhat) what an ngram is in stats, but... what is this program? How does it do what it does? I'm an AI type myself, but not NLP specifically.

Posts: 2830 | Registered: Dec 2004 | IP: Logged |

trousercuit
Member
Member # 3235

posted

I'll see if I can remember... I ended up dropping the course because of time constraints, so none of this is fresh.

We'll call a list of sequences of tokens a "corpus." (In this case, the tokens are letters.) An n-gram model is nothing more than an n-1th order Markov model that you induce ("train") from the corpus.

Let's say you're doing trigrams. That means you need a probability distribution P(token[i] | token[i-1], token[i-2]). To "train," you initialize a counter for each 3-token combination, loop over each sequence a token at a time, and for each three-token sequence, increment the counter. (You pad the beginning of each sequence with two start tokens, '<S>', and append a stop token, '</S>'.) Now that you've got token counts, you can normalize them to get P(token[i] | token[i-1], token[i-2]).

The function SimpleNGramModel.train() does this part. It also does all the lower-order n-grams as well.

So you've got probability distributions over tokens. Now what? Well, you can generate random token sequences and calculate how probable they are (the product of the probabilities of each token given the two before), but that could take a while. Instead, you do this:

1. Start with <S><S>, let i = 2.

2. Pick a token from P(Token | token[i-1], token[i-2]). If this is zero for all tokens, pick from P(Token | token[i-1]). If this is zero for all tokens, pick from P(Token). Assign the pick to token[i].

3. If token[i] == </S>, stop. Otherwise, i++, goto 2.

The generate() function does this part. The rest of the script just parses files and makes calls to generate().

[This message has been edited by trousercuit (edited December 15, 2006).]

Posts: 453 | Registered: Feb 2006 | IP: Logged |

Printer-friendly view of this topic

Hop To:

Contact Us | Hatrack River Home Page

Copyright © 2008 Hatrack River Enterprises Inc. All rights reserved.
Reproduction in whole or in part without permission is prohibited.

Powered by Infopop Corporation
UBB.classic™ 6.7.2