Can Science Tell Us How To Write A Bestseller?

That’s a bullshit Buzzfeed headline.  I’m sorry I led you on.  Of course science can tell us how to write a bestseller… or it will.  I’m pretty sure I’ll see a written-literally-by-the-numbers bestselling book arrive within my lifetime as we throw increasingly-sophisticated analyses at plot structure at prose.

Writers think that writing is special, that no machine could ever do it as well.  The same could be said of wine experts and sommeliers; computers don’t have noses or tongues.  Except as it turns out, if you do analyses of weather patterns and soil samples, computers can predict which years will be the “good” years with increasing – and terrifying – accuracy.

Assuming computational processes continue at their current rate, humans may well find that they have become inferior at every task they do.  (And yes, there’s a novel in there about what happens when humans have built a world where they are both the focus of, and simultaneously the least talented thing within, all creation.)

Writing will take a while to crack, but there’s been what’s claimed to be the first computational study of “What makes a bestseller?”  The paper can be found here. They claim to be able to predict the bestseller-like nature of a book with as much as 84% accuracy from prose stylings alone.

Which makes sense.  Bestselling books would have certain characteristics in common, and prose would be among them.  Prose is, also, the easiest thing for computers to hash out, as prose is merely an assemblage of word choices and sentence lengths.

I’m just not convinced this study did it.

The study took about 2,000 works from Project Gutenberg – i.e., books in the public domain – and broke them down.  (I wish I could see the list of books, but clicking on the link in the footnote gives me an error, and copying and pasting the URL merely does a Google search for me in both Firefox and Chrome.)  Then they used “the formula” against a bunch of random Gutenberg books and a handful of public-domain books, matching the prediction against the actual bestsellerhood.

But I dunno.  It’s a really good start (and some of the discussions on readability vs. success are fascinating – seriously, the paper’s pretty readable when you screen out the mathiness) – but I’m not convinced Gutenberg is the place to start, or that even 2,000 books is a good enough sample size.

Which is to say that for me, I think the definition of “bestseller” prose has changed in my lifetime – Ernest Hemingway used to be a crazy outlier, where even pulpish escape books tended to have denser prose, whereas now authors like James Patterson with his infamously-short 800-word chapters clutter the lists.  Which is not to say that there aren’t dense tomes charting to this day – some love the thicket of escapism – but I do think that using Project Gutenberg tends to skew analysis towards “What made a bestseller in 1940?” as opposed to today’s trends.

And I also think there’s a subtle bias in Gutenberg, in that while some dead-selling books get entered out of the thankful diligence of its volunteers, the books that make it in generally have some distinction – which is to say, a stickiness that marks them in some way memorable.  We’re generally not seeing the pulped failures lumped in.  So it’s not quite a fair contest.

No, what I’d like to see – and I know the writers of the analysis were handicapped by not having access to this data – would be, say an analysis of, 5,000 random manuscripts submitted to various publishers, showing what prose-trends were useful in a) getting published, and b) becoming a bestseller.  Or an analysis of 2,000 books taken randomly from the “thriller” section of Amazon published in the last five years, so we can see what prose styles modern readers appeal to.  Or an analysis of the distinctions in prose styles and how they work – as in clearly James Patterson and Tom Wolfe appeal to different audiences, but how can we break down those styles to see the hallmark elements of each?

That day is coming when science can really do it.  I wouldn’t be surprised to have some slush-reading done by computers in the future.  And have that slush-reading be done well.  But I’m not convinced this study is more than a good first step.

So.  How would you analyze it, techie people?


  1. Emerentia
    Jan 14, 2014

    I’ve only briefly skimmed the linked paper, and I know next to nothing on natural language processing, so I’m wildly out of my depth here. But it’s an interesting paper, so thanks for the pointer.

    I think it would be interesting to correlate sentence length with success.
    Also, it kind of reminded me of the Language Log’s infamous article on Dan Brown (, so I’d be curious to see how many traits those books share with other bestsellers.
    There seem to be an infinite number of things one could do.

    It also seems to me that the question one ought to be asking is not “can science teach us to write a bestseller?”, but rather “Can we predict success of a book based on the prose?” I think the latter is much easier to answer positively, and my guess is certain publishers would probably pay a lot of money for it … 😉

    • TheFerrett
      Jan 15, 2014

      It also seems to me that the question one ought to be asking is not “can science teach us to write a bestseller?”, but rather “Can we predict success of a book based on the prose?”

      Yes, but I want to know how to write a bestseller, and they want to know how to find one. 🙂

      • Emerentia
        Jan 15, 2014

        Well, I *think* there may be a way to get google books in plaintext, which would provide a much more up-to-date dataset.
        My fingers are itching to play around with it, if I find the time. If you have specific questions to ask of the data, let me know.
        If I figure out how to write a bestseller, you’ll be the first one I’ll tell. 🙂

        • TheFerrett
          Jan 17, 2014

          Please do. I’d be really curious to see what it’d be like with more modern data.

Submit a Comment

Your email address will not be published. Required fields are marked *