Markov Model for English Text (a.k.a automatic article/essay/blog writer)

in a nutshell: This experiment used the Markov model for analyzing text. Sounds complicated, but really isn’t. However it does have some pretty awesome results. Basically, this model analyzes the probability of a character following some fragment of characters and then uses these probabilities to create some text.

penny scale: The penny scale denotes the difficulty and the time commitment needed for the project. The closer to 100 cents the rating it is, the more difficult or more time consuming the project is.

Difficulty: 27 cents
Time: 10 cents

results:  I made a quick python script to read in all the articles on the New York Times politics homepage and then store the content in a file. I then ran the Markov model on all that content and here’s a snip-it of what it spat out:

“To do all the things that the state hoped that the United Farm Workers: “Sí, Se Puede!”
“Who the hell is Ken Dahlberg?” President Obama’s Cairo speech, and see what the administration. “None of those things in most places comes from local control.”Ms. Cortes did not respond to the threat of Pakistan’s nuclear weapons falling into the Democratic senator and former inmates also sharply shrink government that all the Republican candidates have been most effectiveness in the crowd groaned heartily.In what they do when they have hardly discussed his own faith. Mr. Jeffress said he emphatically believed that Mr. Obama has been the historical position of perceived important end in itself. It is also the first consumer products to use transistors. He sold the company in 1994 for $139 million.Mr. Perry participated in a news conference by Mr. Perry and had no idea what he thinks.
Counties across the state has not given her consent to run.
He had been a dirty trick.

And here’s a small portion of the output using the technology page of the New York times:

Mr. Jobs was a far more complex figure. As executive only weeks earlier.

On Tuesday, will fix 64 flaws (Computerworld) Sony explains the new features Intel® Solid-State Drive 320 Series enhanced power-loss data protection Center blog Symantec Cybercrime Frontline blog Sophos Naked Security Managers” ( Two Mexicans deny terrorism, face 30 years getting those who already feel overwhelmed by the merger of AT&T and Verizon Wireless Voice Mail help AT&T: How do I use the Problem Steps Recorder (Microsoft) Mac 101: How to tell if an older app will run on OS X Lion Mac 101: Time Machine (Apple) Radio becomes a premium feature on mobile technology exists, easy-to-use smart-home systems run from a few thousand to tens or even hundreds of thousands of people who see themselves were ads.

But Sprint also failed to offer until now.

how to do it: Alright, so how does this thing work? I’ll break it down in a few tasks.

  1. Get some text – This is pretty easy. Get some test files of different lengths. Preferably one really short one to do testing on and the others can be as long as you want (you could even read in an entire book).
  2. Process the text – Start with the first 10 characters of the text and record the the following character. Then, starting with the 2nd character take 10 more characters and record the following character. Repeat until you are done with the data. This however is best demonstrated with an example:Stars are very bright and the earth is where people live and is not very bright.Take the first 10 characters: ‘Stars are ‘
    Record the following character: ‘v’

    On the second iteration:
    10 characters: ‘tars are v’ following character: ‘e’

    In this example ‘very brigh’ occurs twice and ‘t’ has a 100% chance of following that sequence. However ‘ery bright’ occurs twice and has a 50% chance to be a ‘.’ and a 50% chance to be a ‘ ‘. This is all information you need to keep track of, so it’s probably best to store the following character in a tuple like this: (following_character, occurrences).

  3. Create new text – Now that you have all that data stored up in a nice map, it’s time to create your new text. First find the most frequently occurring 10 character sequence (in the example above it was ‘very brigh’), and you can break ties arbitrarily. Have that 10 character sequence be the start of your new text and then create a new 10 character sequence by calculating the probability of a certain character following it. For example the fragment ‘very brigh’ would have to be followed by a ‘t’ and your new fragment is now ‘ery bright’. Repeat until the fragment isn’t found in your map or, for very large texts, a max number of iterations.
  4. Have fun with it – Try it on a whole bunch of stuff. It’s pretty neat.
Feel free to ask me questions if you’re stuck.

what does this mean: Probably nothing, but it is pretty cool. Who knows though, use it to write those god forsaken English essays. If you’re anything like me, it’ll probably make just as much sense.

Comments are closed.