Hatrack River Writers Workshop   
my profile login | search | faq | forum home

  next oldest topic   next newest topic
» Hatrack River Writers Workshop » Forums » Open Discussions About Writing » Creativity = structured randomness?

   
Author Topic: Creativity = structured randomness?
trousercuit
Member
Member # 3235

 - posted      Profile for trousercuit   Email trousercuit         Edit/Delete Post 
So I'm taking a "natural language processing" course this semester, and we've just gone over n-gram language models. Because I'm seriously twisted, I decided to turn it into a right monster...

I've always had a hard time coming up with names. It occurred to me that I could have my computer do it using n-gram models. Here's some output from a program I hacked out in the last couple of hours:

quote:
$ ./ngram.py 5 25 firstnames.txt surnames.txt
Americk
Gisell
Destina
Marlo
Piercedes
Zanderson
Fernard
Mollin
Madelinda
Bryana
Lelani
Brenne
Felica
Gianca
Arace
Brenderson
Cynthias
Malakaila
Maximillan
Khalia
Hillar
Charley
Jamiro
Melandon
Belena
181 duplicates

"firstnames.txt" and "surnames.txt" are the training set - just files full of names. This is a 5-gram model. The output is 25 names where every 5-letter sequence is guaranteed to have the same probability as every 5-letter sequence in the training set. It does generate names in the training set (the 181 duplicates), but it doesn't display those.

"Piercedes" cracks me up. I'm going to use that one.


Posts: 453 | Registered: Feb 2006  |  IP: Logged | Report this post to a Moderator
trousercuit
Member
Member # 3235

 - posted      Profile for trousercuit   Email trousercuit         Edit/Delete Post 
If anybody else wants to play around with this, I've finally got it solid. Go here:

http://axon.cs.byu.edu/~neil/ngram/

Download all the files. It's a command-line program, so there's no friendly GUI. You'll need Python installed, which is standard in Linux, and easy to install in Windows:

http://www.python.org/download/


Posts: 453 | Registered: Feb 2006  |  IP: Logged | Report this post to a Moderator
Kolona
Member
Member # 1438

 - posted      Profile for Kolona   Email Kolona         Edit/Delete Post 
I don't know what you're talking about, but I really like some of those. I'm not sure I'd want to call a character "181 duplicates," though. ( Just kidding) I am kind of partial to "Lelani."
Posts: 1810 | Registered: Jun 2002  |  IP: Logged | Report this post to a Moderator
trousercuit
Member
Member # 3235

 - posted      Profile for trousercuit   Email trousercuit         Edit/Delete Post 
My daughter liked that one, too.

In more intuitive terms, it generates random words that sound kind of like the words you feed it. On top of that, there are many subtleties of its behavior that you'd only understand or even care about if you'd studied temporal probabilistic models.

Ahem. My inner nerd is showing through. Excuse me while I cover it up. Don't peek, now.

I fed it Alice in Wonderland and it churned out these delightful words:

quote:
aroundrence
Serpill
seatered
bream
forgetter
Turt
sudded
existerpill
flamping
musion
expent
onely
greathinkling
tillion
hisk
witnestillarm
sobster

And this was fun:

quote:
./ngram.py 5 10 slavic_surnames.txt japanese_surnames.txt
Nakanishida
Goncharoff
Shimaru
Umenov
Tashi
Hayashimamoto
Vernacki
Asakurai
Shige
Solovkin

I'll never hurt for a character name again. Yay computers!


Posts: 453 | Registered: Feb 2006  |  IP: Logged | Report this post to a Moderator
MarkJCherry
Member
Member # 3510

 - posted      Profile for MarkJCherry           Edit/Delete Post 
Very interesting.
Posts: 52 | Registered: Jun 2006  |  IP: Logged | Report this post to a Moderator
mikemunsil
Member
Member # 2109

 - posted      Profile for mikemunsil   Email mikemunsil         Edit/Delete Post 
that is very cool
Posts: 2710 | Registered: Jul 2004  |  IP: Logged | Report this post to a Moderator
franc li
Member
Member # 3850

 - posted      Profile for franc li   Email franc li         Edit/Delete Post 
If I ever were hurting for character names, I think I'd go with the list of "Charter Members of the official Lord of the Rings Fan Club" at the end of the ROTK extended DVD.

So have you done swear words yet?

Doesn't Luke whine to Uncle Owen about going into Tashi station to pick up some power converters?

[This message has been edited by franc li (edited October 18, 2006).]


Posts: 366 | Registered: Sep 2006  |  IP: Logged | Report this post to a Moderator
Lynda
Member
Member # 3574

 - posted      Profile for Lynda   Email Lynda         Edit/Delete Post 
"friendly GUI"??? I'm lost, but I like the names you generated, and the words, too! So I'm gonna download what you said and hope a non-techie like me can figure it out. . . . Can you give us instructions in plain English if I need them??? Thanks for sharing!

Lynda


Posts: 415 | Registered: Jul 2006  |  IP: Logged | Report this post to a Moderator
EricJamesStone
Member
Member # 1681

 - posted      Profile for EricJamesStone   Email EricJamesStone         Edit/Delete Post 
Fascinating tool. I can tell I'm going to be using this.

A suggestion: make it eliminate duplicates from the list it generates. When working with a limited data set, such as a list of the countries of the world, it generates quite a few duplicates:

Finlands
Island
Britania
Ukrain
Britania
Island
Swazil
Braziland
Armenistan
Senegro
Vatic
Colomon
Turkmenia
Angolia
Island
Turkmenia
Angolia
Turkmenia
Myanmark
Austral
Senegro
Armenistan
Nicaraguay
Myanmark
Belarussalam


Posts: 1517 | Registered: Jul 2003  |  IP: Logged | Report this post to a Moderator
EricJamesStone
Member
Member # 1681

 - posted      Profile for EricJamesStone   Email EricJamesStone         Edit/Delete Post 
If anyone wants to try it out, I've created a rudimentary web interface: http://randomplots.com/cgi-bin/ngram/ngram.pl

(Edited to put URL of new version.)

[This message has been edited by EricJamesStone (edited October 19, 2006).]


Posts: 1517 | Registered: Jul 2003  |  IP: Logged | Report this post to a Moderator
trousercuit
Member
Member # 3235

 - posted      Profile for trousercuit   Email trousercuit         Edit/Delete Post 
Dang, Eric. That's cool.

One problem with removing duplicates from the list is that it may run indefinitely on some inputs. If you make it generate characters with a 7-gram model and suppress duplicates in its own generated list, it'll generally keep trying the same names over and over. I'll probably just have it uniquify (that's a proper computer-sciencey term, I swear) the list before it spits it out.

I'll also add weights. Right now, if you feed it a list of 100 Slavic names and 1000 French names, it'll generate mostly French-sounding names. I'll have it pre-normalize the weights too, so it'll do what you expect without having to adjust the weights on your own. I've wanted to do that for a while now.

I'll post here again when the changes are up.

On your web page, it might make sense to call the first number the "gibberish factor," and limit it from 1 to 6. 1 is a unigram model, which just selects characters without regard to what came before. At higher ns it looks at the preceeding n-1 characters to figure out what to output next. The higher the number is, the more the words resemble actual words.

Heh. I might also do this with sentences. Imagine feeding it a James Joyce novel and having it spit out genuine James Joyce gems!


Posts: 453 | Registered: Feb 2006  |  IP: Logged | Report this post to a Moderator
EricJamesStone
Member
Member # 1681

 - posted      Profile for EricJamesStone   Email EricJamesStone         Edit/Delete Post 
Those sound like some great ideas.

What I'm doing with the Perl script is basically just passing the parameters to your Python script and printing the results it gets back. So it's your script that's doing all the heavy lifting; I'm just putting a front end on it.

Unfortunately, I think the current masculine and feminine name files I added are too broad, because the web page I got them from was global in scope. I'd like to offer a choice of name files split up by language of origin. That would then allow for some cool combinations.


Posts: 1517 | Registered: Jul 2003  |  IP: Logged | Report this post to a Moderator
trousercuit
Member
Member # 3235

 - posted      Profile for trousercuit   Email trousercuit         Edit/Delete Post 
An idea I just had: It might help to "salt" the training algorithm with text from the language you're getting the names from. A list of names won't necessarily contain every valid phonetic sequence, especially when using 5-grams. So, for instance, you'd use a 1.0-weighted list of French given names and a 0.2-weighted Les Mis as training data.

I haven't tested this yet since I haven't got weights implemented (working on it!), but it might help produce more variety when using higher-order models.


Posts: 453 | Registered: Feb 2006  |  IP: Logged | Report this post to a Moderator
oliverhouse
Member
Member # 3432

 - posted      Profile for oliverhouse   Email oliverhouse         Edit/Delete Post 
Okay, that's just really, really cool.

Eric, any chance you can modify the CGI to accept a list of words or names and use those instead of yours?

[This message has been edited by oliverhouse (edited October 19, 2006).]


Posts: 671 | Registered: May 2006  |  IP: Logged | Report this post to a Moderator
EricJamesStone
Member
Member # 1681

 - posted      Profile for EricJamesStone   Email EricJamesStone         Edit/Delete Post 
> Eric, any chance you can modify the CGI to accept a list of
> words or names and use those instead of yours?

Hmm. The Python script only accepts filenames as parameters for word lists.

But it might be possible for me to create a temporary file with the submitted word list, pass the filename to the Python script, and then delete the file after the results are returned (so as to not clutter the server with custom word list files).


Posts: 1517 | Registered: Jul 2003  |  IP: Logged | Report this post to a Moderator
trousercuit
Member
Member # 3235

 - posted      Profile for trousercuit   Email trousercuit         Edit/Delete Post 
I've been playing with generating words from a list of synonyms for a word I'd like to invent a new synonym for, so a text box like that could be very useful. (I'm also salting the training set with about 0.1 of Alice in Wonderland, which seems to work well. For "stupid": "lackward," "backwitted," "ho-humpish," "grosaic," "scatty.") It's been ages since I did shell scripting, but I think it wouldn't be too hard to do that with a here-document. Probably want to put a character limit on it based on the maximum command-line length if there is one.

The new version is coming! I just need to add a command-line switch for words vs. sentences... but I've got some homework to do first.


Posts: 453 | Registered: Feb 2006  |  IP: Logged | Report this post to a Moderator
EricJamesStone
Member
Member # 1681

 - posted      Profile for EricJamesStone   Email EricJamesStone         Edit/Delete Post 
OK, a slightly less rudimentary version is now available here: http://randomplots.com/cgi-bin/ngram/ngram.pl

It allows custom word lists.


Posts: 1517 | Registered: Jul 2003  |  IP: Logged | Report this post to a Moderator
trousercuit
Member
Member # 3235

 - posted      Profile for trousercuit   Email trousercuit         Edit/Delete Post 
I've updated the ngram model and added a bunch of text files:

http://axon.cs.byu.edu/~neil/ngram/

It'll do sentences with the "-s" switch, but it's really for entertainment value more than anything. Because of data sparsity (that whole "almost all sentences are unique" thing), it doesn't make much sense to go above a 3-gram with sentences. It also doesn't parse sentences very well in the first place.

Alice in Wonderland, 3-gram model:

quote:
`There's certainly too much pepper in that case I can find them.' As she said this, she came upon a heap of sticks and dry leaves, and the arm that was trickling down his cheeks, he went on, `I must be a great interest in questions of eating and drinking.

So if you've got an insane character and you need some dialogue, this'll fix you right up.

You can now put an integer or decimal weight after each input file. It pre-normalizes the weights, too: if you give it "scottish_surnames.txt japanese_surnames.txt" both Scottish and Japanese names will be well-represented in the output regardless of how many names each file contains. I've been doing things like this:

quote:
./ngram.py 5 10 synonyms_ugly.txt alice.txt 0.1
cantageous
mephitical
blemistic
fractionable
fraughter
debase
unwelcomely
snappetizing
blotten
dangersometimes


Posts: 453 | Registered: Feb 2006  |  IP: Logged | Report this post to a Moderator
oliverhouse
Member
Member # 3432

 - posted      Profile for oliverhouse   Email oliverhouse         Edit/Delete Post 
I _love_ it.

Mix Scottish names with physics words:

aninstry
Dunderven
Cossic
Scotinum

Use them all in a sentence:

"Craig Dunderven, the Cossic expert in aninstry, discovered Scotinum -- in his closet, where his physicist wife had stored it."


Posts: 671 | Registered: May 2006  |  IP: Logged | Report this post to a Moderator
starsin
Member
Member # 4081

 - posted      Profile for starsin   Email starsin         Edit/Delete Post 
Call me late and technologically retarded, but I can't figure out where and how to download this thing...>_<
Posts: 117 | Registered: Oct 2006  |  IP: Logged | Report this post to a Moderator
starsin
Member
Member # 4081

 - posted      Profile for starsin   Email starsin         Edit/Delete Post 
Sorry for double post...but I figured out how to download, now I can't figure out how to use...dangit.
Posts: 117 | Registered: Oct 2006  |  IP: Logged | Report this post to a Moderator
oliverhouse
Member
Member # 3432

 - posted      Profile for oliverhouse   Email oliverhouse         Edit/Delete Post 
You can quickly use the Web interface if you want. Pretty rudimentary, but it works.

http://randomplots.com/cgi-bin/ngram/ngram.pl


Posts: 671 | Registered: May 2006  |  IP: Logged | Report this post to a Moderator
wbriggs
Member
Member # 2267

 - posted      Profile for wbriggs   Email wbriggs         Edit/Delete Post 
I know (somewhat) what an ngram is in stats, but... what is this program? How does it do what it does? I'm an AI type myself, but not NLP specifically.
Posts: 2830 | Registered: Dec 2004  |  IP: Logged | Report this post to a Moderator
trousercuit
Member
Member # 3235

 - posted      Profile for trousercuit   Email trousercuit         Edit/Delete Post 
I'll see if I can remember... I ended up dropping the course because of time constraints, so none of this is fresh.

We'll call a list of sequences of tokens a "corpus." (In this case, the tokens are letters.) An n-gram model is nothing more than an n-1th order Markov model that you induce ("train") from the corpus.

Let's say you're doing trigrams. That means you need a probability distribution P(token[i] | token[i-1], token[i-2]). To "train," you initialize a counter for each 3-token combination, loop over each sequence a token at a time, and for each three-token sequence, increment the counter. (You pad the beginning of each sequence with two start tokens, '<S>', and append a stop token, '</S>'.) Now that you've got token counts, you can normalize them to get P(token[i] | token[i-1], token[i-2]).

The function SimpleNGramModel.train() does this part. It also does all the lower-order n-grams as well.

So you've got probability distributions over tokens. Now what? Well, you can generate random token sequences and calculate how probable they are (the product of the probabilities of each token given the two before), but that could take a while. Instead, you do this:

1. Start with <S><S>, let i = 2.

2. Pick a token from P(Token | token[i-1], token[i-2]). If this is zero for all tokens, pick from P(Token | token[i-1]). If this is zero for all tokens, pick from P(Token). Assign the pick to token[i].

3. If token[i] == </S>, stop. Otherwise, i++, goto 2.

The generate() function does this part. The rest of the script just parses files and makes calls to generate().

[This message has been edited by trousercuit (edited December 15, 2006).]


Posts: 453 | Registered: Feb 2006  |  IP: Logged | Report this post to a Moderator
   

   Close Topic   Feature Topic   Move Topic   Delete Topic next oldest topic   next newest topic
 - Printer-friendly view of this topic
Hop To:


Contact Us | Hatrack River Home Page

Copyright © 2008 Hatrack River Enterprises Inc. All rights reserved.
Reproduction in whole or in part without permission is prohibited.


Powered by Infopop Corporation
UBB.classic™ 6.7.2