Dueling Data Where, Again, the Point is Missed

[Edited to add: Please carefully read the comments to this post. There are remarks from people with expertise in data analysis. I would also urge everyone to read this post at Dear Author. Note as well that my expertise is in building databases. On a daily basis, I see how bad data architecture renders data untrustable. This is related, but not the same as, expertise in conducting a study and analyzing the results.

Basically, we have three flawed “studies” and my argument here is that publishers and authors alike may be missing the point.

Here’s another post to take a look at: from Courtney Milan – who also has the data analysis expertise.]

So, Digital Book World did this study of authors and income from writing.

Then Beverly Kendall did a study …

Then Hugh Howey sponsored a study.

I would like to observe that Beverly Kendall’s study was closer in type to the DBW study but a girl did it so nobody cares about the results — except the mostly women who understand the point very well, thank you.

The DBW study polled authors. Anywhere from 30-60% of whom were unpublished.

Beverly Kendall’s study polled self-published authors (some of whom still traditionally publish) 100% of whom had at least one book on sale.

The Howey study grabbed 24 hours of Amazon sales ranking data, so it’s not really the same as either of the other two studies. With the Howey data, there are several weaknesses: 24 hours of data is not a basis for extrapolating future performance. You’d have to gather the data over a period of time before you could say much about trends, for example. From what I could see, the data analysis did not account for the fact that a price could, theoretically, change during the 24 hours polled. (A book could go on sale at 10 AM PST such that from 00:00 to 10:00 PST the book sold at price x and from 10:01 PST to 23:59 PST the book sold at price y.)

What’s clear from the Howey data is that Indie books are a significant presence in the top 7000 books.


And now the DBW and Howey camps are all arguing and missing the point, which I will make for everyone in just a bit. For once, a DBW data analysis post was reasoned — because it was written by the data guy. His points about the flaws in his own data and the flaws in the Howey data are well taken. NOTE: I am NOT a statistician.


DBW insists: authors as a whole just don’t make very much money. (DON’T LOOK AT BELLA ANDRE!!!)

This is true.

DBW suggests that the authors who are making money are the elite. The authors of the Howey 7000 (titles) are the elite, trad pubbed or self-pubbed. (Nice. Let’s just define authors who make money out of the analysis. Because that leaves you with the ones who are aren’t.)

The DBW/Trad pubbed camp continually harps on the fact that most authors (where you define “author” to include “anyone who wants to write even if they have no books on sale”) don’t make very much money.

The not-so-subtle subtext behind an observation framed in this way is this: why self-publish when you can trad-publish and have all the hard work of covers, editing, and marketing done for you! LOOK AT NORA!!! — And STILL make not very much money, but whatever.

Allow me to make the point

The point is NOT that as an aggregate, authors don’t make much money.

The point is that if you define author as “someone who has at least one book on sale” AND it is true that the author writes well enough that a traditional publisher would pay them to write for their house, the data from the Howey 7000 AND the Kendall 100% points to a very different conclusion.

The conclusion is that such an author has compelling reasons to choose self-publishing over traditional publishing.

Beverly Kendall’s data shows quite clearly the set of conditions that lead to making money as a writer, but that’s the girl talking and as usual, the boys can’t hear her.


Tags: , ,

15 Responses to “Dueling Data Where, Again, the Point is Missed”

  1. Sunita says:

    This is less a statistics question than an issue of research design and inference questions. Getting a day’s worth of data on Amazon ebook bestsellers is not, in and of itself, a bad idea. But the inferences the HH study draws from a available data are (as I said on Twitter) worthy of a What-Not-To-Do segment in a Quantitative Methods 101 class. You’ve identified the main one, which is that you cannot estimate an over-time trend from a cross-sectional analysis.

    Multiplying a book’s position by 365 to get the yearly output is equivalent to going to one baseball game and then claiming you can describe what happened in every other game that season. No.

    The other problem is more of an ecological inference problem. He’s making predictions (and recommendations) about author *behavior* based on book placement on a ranking. Quite apart from the question of how that ranking is calculated (an algorithm we don’t have the equation for), you can’t analyze the author by the book. There are all kinds of author choices that feed into that final ranking result (subject of the book beyond crude genre category, writing quality, production quality, promotion effort, promotion type, author name recognition, etc. etc.). The only data we might have about author characteristics is the backlist; I’m assuming the # of books observation is a measure of backlist available, but I can’t know for sure because there is no comprehensive description of the data that I can find.

    Howey may be right; it may be the case that more authors are better off self-publishing than are better off going the traditional publisher route. But his data aren’t capable of showing that.

    • Yes! And thank you for your comment.

      My understanding is Howey intends ongoing data collection. The data over time should have some interesting information.

      My first thought was that spreadsheet is not actually the raw data. If the analysis is being run from excel, that’s an issue. I would expect that the data is going into at minimum a MySQL db, normalized, and then analyzed and exported to excel. But maybe not, because there are formulas in the spreadsheet, which you would not need if the data is in an actual database.

      There was no mention of the use of any of the statistical applications that are standard for anyone who knows what they’re doing when assembling statistical data sets — side note: The DBW data person DID name one of those applications.

      • Sunita says:

        Lots of people in my field enter data in an excel sheet and then convert it to be used in STATA, R, or whatever. I agree that relational databases like MySQL are superior in a lot of ways, but great analyses have been produced from data that started in a flat file. So I don’t have a problem w/the excel file per se, except that in order for *me* to run any kind of reasonable statistical estimations I’d have to convert quite a few of the observations. I can do that fairly simply, but I find it annoying that we are told “here’s the data” as if it’s in an analyzable form at the moment. It’s not.

        I don’t see any evidence of anything beyond descriptive statistics being employed, and not nearly enough of those. We don’t even get the standard descriptive statistics that tell us the shape of the data; the DBW table does that for us. No correlation matrix, so we don’t know whether the bivariate relationships he’s reporting in those colorful charts would hold up in a multivariate analysis. I could go on, but you get the idea.

  2. Sunita says:

    Oh, one more point: The person who wrote the DBW column, who I presume is the one who conducted the earlier study, is a woman, not a man. I remember talking about that study and agree that it had major flaws.

    But right now she’s getting hammered as elitist for saying: “Not everyone has the kind of training and expertise I bring to this type of research with my doctorate and years of research and teaching.” Maybe she could have put that better, but that is absolutely true. She’s a sociologist with extensive statistical training and she directs a Master’s level program in Data Analytics and Applied Social Research, which basically means she’s good enough at social-science statistics to run a graduate program. I wonder, apropos your other point, if she’d be getting the same kind of criticism for flaunting her professional qualifications if she were a man.

    • How did I miss that! I think you’re right. And the problem is people who don’t do the data analysis part have no idea of the nuances. She is totally qualified to be talking about the data. And yes, she did say she did the DBW study.

      I have to wonder the same thing about any flak she’s getting. I’ll have to go look at the post again. Because she has the chops.

  3. Ros says:

    The other thing which is frustrating is of course that the DBW data and analysis is only available if you pay $300. If not, you’re dependent on their press release, which may be completely accurate but is still of limited use. I take Sunita’s point that Howey’s data isn’t immediately useable, but it is there and we can see exactly where the flaws in his analysis lie.

    But I also wonder if it really matters all that much. When I was doing some analysis of Beverley Kendall’s results, I made the point that writing isn’t a career that works on the basis of averages. Knowing what an average teacher’s salary is, or an average doctor’s salary is a relevant factor when considering embarking on those careers because your salary is quite likely to end up similar to that. Knowing an average writer’s income (even if you break that down by genre, or route to publication) isn’t a good predictor of any individual writer’s income.

    Part of the problem with the Howey analysis is that he doesn’t really know what questions he’s asking, I think. And so he hasn’t thought about what data-gathering is going to work best. Someone’s just given him a whole load of numbers and he’s made whoopee with them.

  4. Jami Gold says:

    I ignored all the future-telling of the Howey report. My takeaway from his article was some anecdotal (meaning, one day) numbers on the sales percentages of ebooks vs. print for those genres.

    That’s useful information to be sure, but I wouldn’t try to extrapolate all the other things people are talking about.

    I was much more impressed with Beverley Kendall’s survey and did a whole post about how her results offer not only information, but also lessons we can apply to improve our chances of success.

    As she posted on FB, what’s the point of hand-wringing over the majority of self-published authors doing poorly if we’re going to emulate the 20% who are doing it “right”? Her results allow us to separate out that 20% and learn what approaches to emulate.

    • Sunita says:

      There are a few problems with the Kendall survey:

      (1) It’s a non-random sample, it’s self-selected, and the data are self-reported. That’s three ways in which it has the potential to be skewed. You can assume it’s not representative.

      (2) It’s overwhelmingly romance authors, which is OK if you want some (probably) non-representative data on that group (the fact that it’s hundreds of authors is very good, of course). But the comparisons she draws with non-romance authors are untenable (not enough respondents and not representative).

      (3) The sub-group conclusions she draws become less and less likely to be useful as the samples get smaller. 20 non-representative, self-reporting people are just not going to give you data you should rely on.

      (4) There are plenty of authors who write in more than one subgenre. How did she treat those data? She doesn’t tell us that.

      I’d be very, very wary of taking “lessons” from that data.

      I think it’s great that Kendall went to all this trouble, and the fact that so many people were willing to provide information speaks well to *eventually* getting reliable and valid studies, from which we might well be able to take lessons. But we’re not there yet.

      • Sunita:

        Yes, the Kendall study as all the issues of self-selection and a fairly small sample size. It pretty much only went to people on a couple of the Romance Self-pub loops.

        Not to mention, issues with wording of the questions and answer selections.

        What I took away from the Kendall data was the (apparent) link between professional editing, professional covers, and sales.

        But there’s nothing to say whether the authors doing those things are doing better BECAUSE of those things or because they are, in general, better, more savvy writers.

        I think there’s significant overlap. Purely anecdotally, it’s my observation that the authors who get professional editing understand that it makes their book stronger and they are willing to put the right money toward the right editor. The writers I see who resist paying for editing do not, quite often, have the experience with the kind of editing that makes a book better, AND they tend to be weaker writers.

        So, to put it bluntly, less talented authors forgo editing because they can’t see how badly they need an editor.

        What we need is someone who understands how to put together a study that avoids unintended bias in the questions and in the way answers are accepted, and also knows how to deal with the data collected.

        • Ros says:

          Right. When I was analysing the Kendall report, I suggested that what we can conclude from it are ‘the habits of successful self-published authors’. It’s very hard to differentiate correlation from causation on these kinds of things, other than anecdotally.

  5. Ros:

    Yes, as to the DBW report.

    The real issue, in my opinion, is that contrary to public statements, publishers ARE losing the midlist and it DOES matter. Because, and again contrary to some public statements, the midlist is not a money losing drag on publishing. As Isobel Carr pointed out, if the Midlist is such a drag on the bottom line, why don’t trad publishers gleefully revert rights to those supposedly money losing books? — Answer — because they are not money losing.

    What we are seeing is the early signs of a reaction from trad publishers; which is why we hear so much about how little SP authors make — in the aggregate. That deflects the argument from actual issue, which is, if you have a traditional contract offer, how does the SP route compare to that?

    The answer, for the romance midlist, and with the current trad contract clauses, not so well at all, for people who would fall into the midlist.

    • As Isobel Carr pointed out, if the Midlist is such a drag on the bottom line, why don’t trad publishers gleefully revert rights to those supposedly money losing books? — Answer — because they are not money losing.

      Well, that’s not quite right.

      First, all costs associated with midlist books are sunk costs, and so the books can still be money losing, while holding onto the rights can be a money maker.

      Second, I think most publishers are holding onto the rights they do on the off chance the author hits it big and they discover they’re not holding a midlist author, but a huge hit.

      Just think how much Nora Roberts’s old Harlequin books are worth to Harlequin. So yes, they’ll hold onto hundreds of books that make almost no money on the off chance that they don’t relinquish the rights to the next possible Nora Roberts.

      • Well, dang. WordPress ate my brilliant reply. Starting over.

        Thanks, Courtney, for the reply and clarification.

        Yes, publishers are not at all motivated to return rights, whether by contract or buy out. I understand the position of not wanting to let an author buy back rights– they have a business to run, and they licensed the rights for favorable terms. I have a bigger problem with what seems an awful lot like ignoring the provisions of contracts to which they are also a party when an author requests a reversion and gets crickets in response. But that’s a digression.

  6. […] are not the same person I am. Neither of us is the same person as Carolyn Jewel, who says of Howey’s […]