Rocket Bomber - article - rankings - site news - About the Charts


About the Charts

filed under , 13 June 2009, 15:36; byline — Matt Blind

The way some others do it:

Rankings reflect sales of graphic novels, for the week ending May 31, at many thousands of venues where a wide range of books are sold nationwide. These include hundreds of independent book retailers (statistically weighted to represent all such outlets); national, regional and local chains; online and multimedia entertainment retailers; university, gift, supermarket, discount department stores and newsstands. In addition, these rankings also include unit sales reported by retailers nationwide that specialize in graphic novels and comic books. An asterisk (*) indicates that a book’s sales are barely distinguishable from those of the book above. A dagger (†) indicates that some bookstores report receiving bulk orders.

This is from the Arts Blog of the Gray Lady, the venerable New York Times. They’ve been posting a Graphic Books Bestseller Chart for three months now; the pull-quote above is from the footnote to a recent list, after the jump and all the way at the bottom, and then offset by formating the paragraph in italics so your eye will just glide over it on your way down to the links and the comments appended to the end.

“It’s just a technical note, don’t worry your pretty little head over our methodologies.”

The problem, of course, is that a “New York Times Bestseller List” is like the “Dow Jones Industrial Average” — sure, each is just a listing of top performers compiled for the benefit of a newspaper using a decades old secret formula; but the impact of the whole is so much more than just a top 10, or top 25, or the estimated value of an imaginary, arbitrary portfolio of 30 near-randomly selected stocks, the components of which are swapped out at whim — it’s designed from the get-go to seem authoritative while cherry-picking what they’d care to track. I’m not sure why the Dow continues to get press, other than the fact that there is no good alternative, and tradition and inertia lend the Dow a gravitas that no new index (or mere average with obvious, independently verifiable inclusion standards and a larger data pool) can match.

And ‘Bestsellers’… For an author and publisher, the New York Times Bestseller imprimatur is money in the bank. They proudly emblazon said status on the cover of the book, and the lucky wordsmith will forever bear the sobrequet of “New York Times Bestselling Author”.

In the publishing world, this is a big deal.

Other papers-of-record (The USA Today list, for example, which is not only longer but more inclusive and — on it’s face, at least — much more democratic) and even major retailers also maintain bestseller lists, but they’ll never be able to conjure the same magic as the New York Times. Something about old New York’s status as a publishing centre, and the close to 70 years that the NYT has published their charts, are what make their bestsellers ‘the’ bestsellers, but even Wikipedia can point you to older charts, and the controversy surrounding the term, and the different ways the term ‘bestselling’ is used depending on context, region, and even things like the format of the book and the venue in which it is sold.

It’s all hokum and snake oil. Hell, any wonk with a blog and too much free time on his hands can compile a chart. [*ahem*]

In this case, I can one-up the Times — I am proud to present: Transparency.

Here’s the method and methodologies, sources and scores, how I weight the data and why, and in way more detail than anyone really wants.

I don’t care if you want it or not; it’s not so much that my inclusion of this information makes my chart better or more accurate than Neilsen Bookscan, or USA Today, or the New York Times, or ICv2, or a slate of retailers. (Retailers, for example, know exactly how many copies they’ve sold, and similarly Publishers know exactly how many books they get paid for — and none of that data is forthcoming.) My numbers are still just estimates; my sources are online retailers and who knows? they could be lying to us. It’s not about ‘proving’ my numbers are better — this is a good faith effort to share with you [and the rest of the uncaring internets] exactly what it is that I do. You are invited to make your own value judgements as to what it’s worth.

In a way, everything is also verifiable; though the sticky wicket is that my chart relies on ephemeral data that posts to the internet once, before being replaced by a more current version — so short of exactly duplicating my data collection method, there may not be way to call me on it — but sources are clearly identified, both here and in each post. Go look at the same websites, wallow in the same data set. Get a feel for the overall geist of online sales in the same way I have. Instead of closing off my sources, and hiding my process in a footnote, here is a great-gobsmacking-big invitation to share with me. Follow along with the home game version. A couple times a year I even post my entire spreadsheet, with 3 or 6 months worth of data. Dive In, Math is Fun!

##

Two months ago (at the time of posting: April ’09) I made the decision to change from just a manga chart to a bestseller list for all graphic novels. And I’m still trying to cope.

I’ve hit the limit of what one dedicated person can do on a part-time basis. In fact, we’re past that limit as I’m far less than ‘dedicated’ and will occasionally take a couple of days off to watch a set of newly acquired anime DVDs, or read through a half-dozen manga of a given series, because I am a loser fan boy first before I am a blogger or math nerd.

So. The charts post sporadically. As I get more details for this new listing nailed down, my weekly time commitment will also decrease, slightly, and so I should be able to settle back into a regular posting schedule, but this one minor (on the face of it) adjustment has thrown my overall progress back a year. Maybe more.

Enough editorial…

Hi, my name is Matt. This is RocketBomber.com, and this is where I post a bestseller chart for Graphic Novels.

The Core of the Charts is made up of data from three sites: Amazon, Barnes & Noble, and Borders.

Once a week, I visit each site to check their Graphic Novel categories, and I sort the search results by ‘bestselling’. The links above will pull up exactly that.

I then click through, page after page, and type the titles into a spreadsheet in the order that they are ranked on the sales site. [this is the hard part]

And once I have a full list, I assign points to the books depending on how highly they rank. Add up the points each title earns (and add on similar data from a half-dozen second-tier sales sites) to get a composite score, and there’s your ranking.

In concept, it’s that simple.

##

In practice, because the sites themselves can update as often as once an hour, after I load up a website & sort the search results, I then click open each new page in a new tab until I have 20-100 tabs open, representing a snapshot of the full sales (top 900-1200 titles) of this particular sales site over a relatively short time-frame (10-15 minutes). And then I start the data entry.

For Borders, which handily allows 50 titles to a page, I only have to open 20 or so tabs. For B&N, which can support 100 titles to a page but maddeningly restricts some searches to a mere 10 titles a page with no option for more, I load up 95 tabs. And even though Amazon defaults to 12 a page — a default that no one can change, and currently, a default that one must navigate by clicking ‘next’ on each and every page — I also load up 95 tabs (1140 titles) because even though I only want a top 900, Amazon search results include so much ‘noise’ I know in advance that I’ll need to skip between 175-200 titles because they aren’t graphic novels.

Dear Amazon, Newsflash: Just because Gaiman wrote it, doesn’t make it a comic.

Let’s go back one half step: the top 900 titles.

To compile my charts, for the top three sites (op cit. Amazon, B&N, Borders) I look up the First 900 graphic novels listed. Yes, I skip a few; as noted, not everything coming up on a search is a graphic novel. I intentionally skip some others.

[currently: I skip most kids’ ‘picture books’ and adaptations of classics and material in the public domain. — I ♥ the classics, and also love the comic adaptations as much as the next guy (or more), but with up to 5 different versions of a book, all from different publishers, my spreadsheet (and my poor brain) can’t track them all — does one consider the source (i.e. Huck Finn) or the imprint (i.e. Papercutz Classics Illustrated) as the ‘series’ in this case? On the one hand, all versions of the source book should be the basis for a title ranking — on the other hand, the consumer presumably would be looking for all adaptations under a particular imprint. There is a third case of course: it’s my chart and this makes my brain hurt so I just skip ‘em]

Matters of inclusion aside…

That was the top 900 titles. Here’s how I score them:

#1 gets 100 points.

Perfectly straitforward. And that’s my benchmark: #1 on Amazon, or B&N or Borders, is worth 100 points and everything else (lower ranked on one of these sites or appearing on a different site) is worth less; some fraction of 100. I only belabour this point because from here it gets messy:

#2 gets 99.7 points, and we proceed down the charts by increments of three-tenths through 234 titles (#234 scores 30.1 points) and then shift to increments of one-tenth (#235 scores 30 points, #236 29.9, and so on) through the next 200 titles and then we switch to increments of five-hundreths of a point…

…yeah, I know. Here, look at this:

#1 gets 100 points. Everything below that only scores some fraction of 100. By the time we’re ⅔ of the way through the source chart, that fraction is the nominal one tenth of a point (not quite zero, but close) and I keep scoring titles until I get sick of looking at the website, or I hit 1000 titles, or both. In practice (and in the chart above, and for all my GN rankings as posted to date) I’m going to push until I hit 900 titles and then (gratefully) stop — but one tenth of a point isn’t going to change anything and so long as the data looks good I reserve to right to keep on going. A lot of the “long tail” in my posted manga charts (5000+ total titles at the end of ’08, the last 300 or so only appearing once or twice in sources all year) came from this kind of extended data entry.

Why all the decimals? I discovered early on (back in ’07) that if I posted nice round numbers it didn’t matter how I introduced, qualified, or explained the chart someone would mistake my score for an actual unit sales number. The simple solution (at the time) of dividing by ten — inserting a decimal point — instantly changed that. Since then I’ve modded the spreadsheet to incorporate the fractions.

So, that’s three charts that form my core, and the ‘fancy’ math I use to approximate sales.

[Note: scoring methods changed slightly starting with the charts dated 19 July; there is an explanation here. The chief upshot is a ‘fatter’ curve that reflects a greater emphasis on midlist and backlist titles, but the top of the chart does not change — #1 still equals 100 points — and everything else is still just scored at a fraction of that. A full accounting will be presented in the next update to the FAQ; until then please remember that while some of the arbitrarily assigned scores may be different, the Theory and Reasons behind the chart as presented below are still valid]

I also check one other bookstore’s site: Books-a-Million, but given their lower sales volume I discount their results slightly, and also delve less deeply: I check a top 300 (with #1 scored at 30 points, and decreasing by a tenth of a point down a straight line) with the addition of another 100 ranking titles at 0.1 points each, similar to above but stopping at #400.

Then there are a lot of top 100 charts from various sites (buy.com, Powell’s, overstock.com, deepdiscount.com, Tower, half.com) and also Amazon’s hourly top 100, which is different from the Graphic Novel ‘bestseller’ search results, oddly enough, and which I check 5 times a week — roughly once a day.

The ‘number ones’ at each of these sites score 10 points, and down the list by increments of 0.1 points until we get to #100, which scores a single tenth of a point. (Tower and half.com are proving to be of marginal utility; I may have to discount them further, i.e. #1 = just 5 points, or in the case of half.com just drop the site entirely, but as of June ’09 they are still components in the rankings)

That’s where all the numbers come from: After looking at ten different sales sites and doing all the data entry and scoring 15 different source charts (with its hourly bestsellers, Amazon gets checked a total of six times) of varied lengths and value — and then doing it again, as each set of posted rankings pulls from two weeks of data, we have just started.

Now that we have data we can run them through the spreadsheet. The trick, of course, is teaching a spreadsheet how to grok book titles, and how to discern what books are part of which series. [It’s a matter of careful formatting more than anything else — the spreadsheet knows how to put things in order, and how to add, and it can also compare two line entries to see if part or all of the line is the same. Using these simple tools, it’s possible to compile a chart of rankings — if you’ve set the sheet up correctly.]

##

Let’s assume that the same titles are all ranking in the same order on every sales site. Watchmen is number one everywhere, for example. (It sounds ridiculous, I know, but let’s go with this model.) If that were the case, and using the scoring method above, we’d get a top 900 titles that would score like this:

I’d posted a similar chart earlier in the teaser (actually the same chart, scaled differently) and in the comments JRBrown said, “To me this graph looks pretty similar to those charts of Amazon’s overall book sales that were so shocking 5 years ago, only a lot more compressed (with the top 100-150 books accounting for maybe half the sales?).”

Yes, exactly.

What I didn’t tell you is the chart above doesn’t represent actual sales, it’s only a model. This is what my approximation of online sales looks like in my monstrous, steam-powered difference engine computational works. And since I use weighted scores in comparing titles, this is why the Manga and Comics 500 are often referred to as online sales estimates. No one is giving me actual sales numbers, and these are the lengths (and depths, and bredths) to which I’ll resort to figure this out.

If we take the model and plug in the real data (the actual rankings found on online sites) along with a little ice, some lime juice, bar mix, triple sec and tequila and hit frappé, the graph looks a little more like this:

For all titles found and scored (2,725 over the two week period, 4 May to 17 May, charted above) at the ten sites currently tracked, with #1 (Watchmen) scoring close to the theoretical maximum number of points.

This is where the scores lead us. And it all starts with #1 @ Amazon = 100 points.

##

Using the sales estimates (and occasionally, a smidge of extra math) I can then sort, bend, fold, spindle and mutilate the ‘main’ chart into a number of secondary charts:

  • The Top 50 Series chart uses the same scores assigned to books for the Comics 500, but with a sprinkling of extra math: A weighted score is determined using the points from the top two ranked volumes of a given series as a base, and only adding one tenth of the scores for all other books in the series. [read more]
  • The Publisher’s Scorecard is the most straightforward of the lot (provided I’ve entered the publishing info for the titles) — just look at the Top 500 and count: so many for DC, Marvel, Viz, so many for Tokyopop, Dark Horse, IDW, etc. Actually, I get the spreadsheet to count them for me, but that’s the gist of it.
  • New releases and preorders are almost as easy: once the publishing data for the books has been updated, a simple sort by date pulls up the requisite info for the post.
  • The “Midlist 500” is a re-ranking of manga volumes after excluding all non-manga, and also the books from the top 5 manga Series: At the time of this posting (13 Jun 08) the top 5 series are Naruto, Fruits Basket, Vampire Knight, Bleach, & Death Note; all together this represents some 150 books of which at least 100 are clogging up my manga chart. After excluding these volumes I then re-run and re-number the Midlist chart with the books that are left.

Actually, The Midlist 500 is the reason I set up the spreadsheet and do the rest of the math.

##

See also: The Old Faq (last updated 7 Mar 09)

Archived Lists:

Reconstructed 2007 Manga Chart
2008 Winter — unavailable; the charts were on hiatus 6 weeks through Jan/Feb while the spreadsheet was retooled.
2008 Spring [manga only]
2008 Summer [manga only]
2008 Autumn/Annual Summary [manga only]
2009 Winter [manga only] (coming July)
2009 Spring [finally, we’re posting all GNs] (coming July/August)

##

boilerplate anti-©:

Graphic Novel estimated online sales rankings compiled by Matt Blind for the benefit of the Comics Fan, Creator, and Publishing Communities and posted in the rankings category at RocketBomber.com. Derived from publicly available information; if you feel your intellectual property has been infringed upon then I’d advise you to chill, consult your lawyers again, maybe grow a thicker skin, and then also recognise that you’re getting a free, weekly link directly to your lovely offerings [right at the top of each post, in case you missed it] on a blog that specifically caters to fans of the medium. Maybe you should be sending me money, or free manga, as opposed to getting your boxers/panties in a bunch over imaginary copyrights.

All data as posted released back into the public domain (be free, little numbers, go frolic and prosper) with merely a humble request that you link rather than steal, and that any derivative works include an attribution and also remain free to all.

##

If you have questions, corrections, or concerns that should be addressed in the body of this post, please send an email to matt [at] rocketbomber [dot] com. Questions, corrections or concerns placed in the comments below will be addressed in a more casual manner after I’ve downed a few beers and am feeling saucy.



Comment

  1. before you call me on it:
    yes, the first image was compiled with a little photoshopery, but does in fact represent the actual Amazon ‘Graphic Novel’ ranking of each of those non-graphic-novels at time of posting.

    Comment by Matt Blind — 13 June 2009, 16:09 #

  2. I’m sure you’ve heard this before, but have you tried automating the data collection phase at all?

    Comment by bbot — 13 June 2009, 18:40 #

  3. @bbot:

    I’m a math geek, not a programmer. Also,

    How do you teach a program that Naruto 1, Naruto vol. 1, and Naruto: the Tests of the Ninja are all the same book? Or that the paperback and library edition are essentially the same book, even with different ISBNs? What about Naruto en Espanol, or the Naruto Chapterbooks (which also have a volume one) or the sites that list something as ‘Naruto’ only without a volume number?

    And with 10 different sites, there are going to be 12 different ways to list something.

    I can look at the cover thumbnail and read the volume number; I can glance at a listing and know exactly what it is, even with typos, even if the listing is partially incorrect — doubt a web spider can do that.

    I’m not ‘indexing’ a page for later searches, and ‘keywords’ and the like are of little utility. I need an ordered list of books as they appear on a site, and so far I need a human to actually look at it so I know the list is correct. In the amount of time it would take me to interpret the results, to review and fix ‘automated’ errors, I could probably do the data entry myself.

    (There exists the possibility of ‘automating’ the process by paying someone in India to do it, but that’s a tad expensive — the services I’ve seen online charge the same per hour, or more, as I get paid at work. Talk about diminishing returns)

    Like I said, I’m not a programmer. There may be some magic out there that would enable a computer to acquire the data for me. This is one of those cases, though, where the cheapest and fastest way is to get a human to do it — Or I can do it myself, since there isn’t a human willing to do it for me.

    Comment by Matt Blind — 14 June 2009, 00:27 #

  4. Not extensive automation, but some fairly trivial workflow stuff, like a wget job that grabs the first 1000 pages and concatenates the html files, so instead of 95 tabs, you have 1. You’d have to strip the <html> tags, then add them at the start and the end of the concatenated file, or not, depending on your browser’s ability to handle hideously malformed html.

    Actually, two seconds.

    Comment by bbot — 15 June 2009, 00:21 #

  5. First thing, I wrote a little bash script to generate the links list, since it would, of course, be 101 lines long, and I didn’t feel like copy and pasting each one from firefox, or whatever.

    First, let’s look at the base url.

    http://browse.barnesandnoble.com/browse/nav.asp?No=10&N=0+989443&Ne=989443&visgrp=fiction&act=BC_ANC

    Clicking on “next” results in:

    http://browse.barnesandnoble.com/browse/nav.asp?No=20&N=0+989443&Ne=989443&visgrp=fiction&act=BC_ANC

    The string of interest is “No=20”, which increments by ten every page. So the resulting script is:

    for (( i=30; i <= 1000; i=i+10 ))
    do echo “http://browse.barnesandnoble.com/browse/nav.asp?No=$i&N=0+989443&Ne=989443&visgrp=fiction&act=BC_ANC”
    done

    This script creates a variable (i), sets it to 30 (i=30), tells it to run the loop as long as i is less than or equal to 1000 (i <= 1000), then increments i by 10 each loop. (i=i+10). It then inserts the current value of i in the base url. It starts at 30 because I already had the first three pages of sales ranks.

    Running this script and piping the output to “links.txt” results in a 101 line long text file, with a url on each line. We tell wget to use this file, to wait a random time between 0 and 2 seconds between hitting each link, and to use the user-agent “Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; GTB6; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30618)”, who happened to be the most recent visitor to bbot.org when I plundered the log for user agents. We also tell it to log to wget.log, and to be verbose, because —verbose is awesome.

    wget —random-wait —user-agent=“Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; GTB6; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30618)” —input-file=links.txt —output-file=wget.log —verbose

    Wget then hogs the terminal, but remains silent, because we told it to output to wget.log. We can watch along in another terminal by tail -f’ing wget.log, and occasionally ls’ing. It goes pretty fast, taking 20.52 seconds to download 7.4 mebibytes worth of html. After cat’ing everything you end up with a mighty browser-crushing 7.4 mebibyte html that makes Opera colossally unhappy, but which Firefox can handle fine-ish, and takes forever to upload, so we bzip it to a svelte 220 kibibytes.

    http://bbot.org/sales.html.bz2

    Standard large html file disclaimers apply. It’s big big big, media rich, and outrageously malformed.

    Comment by bbot — 15 June 2009, 01:40 #

  6. Thank, bbot, I might give that a try. If I can decipher just what it is you’ve recommended ;)

    Comment by Matt Blind — 15 June 2009, 06:07 #

  7. There’s not a whole lot to decypher. It’s the first thousand graphic novels in one HTML file, instead of 101.

    Comment by bbot — 15 June 2009, 17:26 #

  8. @bbot:

    Thanks for the work, and the link, but an automated way to load web pages (while handy) kind of misses the point:

    It’s not the 3 minutes I spend clicking links at B&N, or the 20 minutes spent navigating Amazon, or the 90 secs. one needs to load Borders offerings — It’s the 12 hours after that, typing the titles found into a spreadsheet. Automation only goes so far.

    This isn’t so much a matter of html as translation — B&Nspeak or Amazonspeak into a single universal graphic novel tracking spreadsheet.

    [note my earlier comment about the difficulty of abstracting actual facts — that’s the problem at hand, as opposed to merely manipulating urls and loading html]

    The other reason I break down the front door and do a brute force load-in of a website, is the capricious nature of sites: where does a fancy script get me if one (or all) of my sources change the way they serve up pages? I have a half-hour to physically click links-on-pages each week; that’s the small ante. The big stake is the rest of the process, the days of data entry that follow just looking up web pages.

    Comment by Matt Blind — 1 July 2009, 15:00 #

Commenting is closed for this article.


menu

home
about the site
about the charts
contact

subscribe

RSS Feed Twitter Feed Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Add to Technorati Favorites!

categories

5by8
anime
bragging
business
comics
commentary
field reports
found
general fandom
linking to other people's stuff
manga
publishing
rankings
retail
reviews
site news
snark
twitter
versus


-- not that anyone is paying me to place ads, but in lieu of paid advertising, here are some recommended links.--

support our friends

note: this comic is not about beer

note: this comic is not about Elvis

if I win the lottery, Bradley Schenck will be getting a pile of cash to redesign this site from scratch.

In my head, I sound like Yahtzee (quite a feat, given my inherited U.S.-flat-midwestern-accent.)

where I start my browsing day...

...and one source I trust for reviews, reports, and opinion on manga specifically...

...and where my casual browsing usually ends, past the research for various articles that I have to do each day.

Note: NSFW. Icarus, best described as "the Thinking Man's Porn Manga." Simon does me the undeserved favor of dropping free review copies my way, which I have callously ignored to date. Simon's blog is also a must-read, for a look at the manga industry from a small indy publisher's perspective. Plus, porn.

attribution

- Powered by Textpattern.
- Afterglow template ported by Stuart.

Top banner photo credits, from right to left:
- Soviet concept art vintage 1967, ganked from Dark Roasted Blend
- Excerpt of a souvenir card from the 1929 round-the-world flight of the LZ-127 Graf Zeppelin, ganked from Oldbeacon.com (via Metafilter)
- Goodyear Rocket Airship concept, posted in a 1958 Popular Mechanics article; ganked from online archives of the rec.aviation.military usenet group, found via GIS.
- Photo of the sculpture "Guard" by Hans van Bentem, located in Rotterdam, The Netherlands; ganked from Wikimedia Commons
- Soviet concept art from 1970, also ganked from Dark Roasted Blend
- Butt end of a R-7 Soyuz-class rocket booster of recent vintage, ganked from Michael Saxe at TravelBlog.
- Overlayed schematics, colour-inverted, of the Lippisch P-09 Rocket Plane, the Sänger-Bred Rocket Bomber, an unnamed heavy-tank-class mecha, and a second unnamed mecha in fighter-jet configuration (both anonymous to keep my ass from infringement -- and at that resolution & in combination I claim fair use as part of an artistic and satirical collage)
- Excerpt of "Dr. J.W. Mauchly makes an adjustment to ENIAC, the massive computer he designed to assist the U.S. military during World War II," ganked from Science Clarified
-- Logo art is original, credit M. Blind; logo created and photos composited in the Gimp 2.2