from the relax,-be-happy dept
One of the most important moments in the rise of a radical idea is when the fightback begins, because it signals an acceptance by the establishment that the challenger is a real threat. That moment has certainly arrived for open access, most obviously through moves like the Research Works Act, which would have cut off open access to research funded by the US government. That attack soon stalled, but the sniping at open access and its underlying model of free distribution has continued.
Here, for example, is an interesting post by Kent Anderson, who is CEO/Publisher of the Journal of Bone & Joint Surgery, with the title "Not Free, Not Easy, Not Trivial -- The Warehousing and Delivery of Digital Goods." The starting point is as follows:
There is a persistent conceit stemming from the IT arrogance we continue to see around us, but it's one that most IT professionals are finding real problems with -- the notion that storing and distributing digital goods is a trivial, simple matter, adds nothing to their cost, and can be effectively done by amateurs.
As a result, he thinks, there is "a consistent theme among dew-eyed idealists about publishing -- that digital goods are infinitely reproducible at no marginal cost, and therefore can be priced at the rock-bottom price of 'free'."
Well, they're certainly "infinitely reproducible", but nobody seriously claims that can be done at zero marginal cost. It is, however, extremely small. Indeed, in another post, Anderson himself provides a rough estimate for one part of the cost -- the online delivery of a 1Mbyte file: $0.001. It's true that delivering millions of copies would represent a more significant sum, but that ignores things like BitTorrent, which effectively shares the cost of distributing digital goods among many downloaders. Using such P2P delivery systems, the cost to the publisher really is vanishingly small.
But Anderson thinks there are other issues:
Even beyond just their power requirements, digital goods have particular traits that make them difficult to store effectively, challenging to distribute well, and much more effective when handled by paid professionals.
Why might that be?
First, digital goods are not intangible. They occupy physical space, be that on a hard drive, on flash memory, or during transmission. A full Kindle weighs an attogram more when fully loaded with digital goods, and there are hundreds of thousands of Kindles in the field.
According to the source referenced, "the difference between an empty e-reader and a full one is just one attogram" -- a million-trillionth of a gram. Even with "hundreds of thousands of Kindles in the field," that extra fraction of a gram spread around the world is hardly going to be a major problem. But leaving aside the issue of weight, it's certainly true that this data takes up space on storage media:
The proliferation of digital goods -- photos, music, Web pages, blog posts, social media shares, tweets, ratings, movies and videos, and so much more -- puts incredible and growing pressure on metadata management techniques and layers. This means building more and larger warehouses, which adds to both ongoing costs for current users and migration costs as older warehouses are outstripped by new demands. Megabytes become gigabytes become terrabytes become zettabytes and beyond. Where will they all fit?
One answer is "in your pocket:" according to Amazon, a 1 terabyte portable hard disc currently costs around $100. Yes, a zettabyte might be a little more pricey, but judging by this recent large-scale, real-life project, we're still in the sub-petabyte era, so storing all this data isn't really going to require a warehouse -- a few rack systems should suffice.
But independently of where you are going to put it, another question is: Where is all that important metadata going to come from? As Anderson rightly says:
Creating, updating, and tracking the metadata is a chore for owners of digital goods. Poor metadata -- like a photo name off your digital camera of DX0023 -- can make the photo hard to find or use. Better metadata -- usually applied by humans, like "Rose in bloom, August 2006" for that elusive photo -- makes more sense.
That's mostly true, most of the time. But in another paragraph, quoting from a description of the Library of Congress's effort to archive all Twitter messages since 2006, Anderson also shows us why metadata is not always an issue:
Each tweet is a JSON file, containing an immense amount of metadata in addition to the contents of the tweet itself: date and time, number of followers, account creation date, geodata, and so on.
That is, the data comes with "an immense amount of metadata" automatically, because of the way Twitter (wisely) designed its system. And even for datasets that require metadata to be applied by hand, crowdsourcing is proving an efficient and low-cost way of providing it.
Other issues raised by Anderson are that digital goods need to be backed up, and secure, but that's hardly rocket science: open source solutions that cost nothing to acquire (but not to run, obviously) have been around for years. His main concern, however, seems to be about the physical infrastructure required:
Digital warehouses are more expensive to build. Site planning is a major undertaking. A physical warehouse is something a small business owner can buy and construct with relative ease. They aren’t expensive (a concrete pad, a sheet metal structure, some crude HVAC, and a security system is usually all it takes). A digital warehouse is expensive to construct -- servers, site planning, redundant power requirements, high-grade HVAC, earthquake-proofing, and so forth. This means that digital goods have to work off a much higher fixed warehouse cost.
It seems unlikely that it is cheaper to build a typical physical warehouse than to install a typical LAMP stack on rented commodity servers in a few different geographical locations (or in the cloud) to provide resiliency and backups. This exposes the central problem with Anderson's argument about the amount of data that must be handled, and the necessity for huge and expensive infrastructure to handle it: he seems to be lumping together very different kinds of digital data.
In the realm of digital goods, we’re reaching a point at which we’re facing trade-offs. Already, some data sets are propagating at a rate that exceeds Moore’s Law, which may still accurately predict our ability to expand capacity. And these are purposeful data sets. As data becomes an effect of just living -- traffic monitoring software, GPS outputs, tweets, reviews, star ratings, emails, blog posts, song recommendations, text messages -- we as a collective will easily outstrip Moore’s Law with our data. If there’s no place to put it, and nobody to manage it, does it exist?
Yes, genomic data is spewing out of DNA sequencers at an incredible rate; yes, the Large Hadron Collider produces almost unimaginable quantities of data. But these are exceptions: nobody is talking about letting the general public access this stuff in the same way that they can download media files, say. As I've pointed out in a previous post, we are fast approaching the point where we could store every Spotify track on a single hard disc, and the same will soon be true for every film, book -- and academic article.
For the latter, despite Anderson's title, it really is the case that storing and sharing them is nearly free, pretty easy and mostly trivial, which is why open access makes sense and is constantly gaining ground. The sooner traditional publishers stop fearing and fighting this trend, the sooner they can embrace and enjoy the possibilities this new abundance opens up for them.
Follow me @glynmoody on Twitter or identi.ca, and on Google+
Filed Under: data, kent anderson, open access, publishing
Companies: amazon