Big data publishing: too good to be true

Cross-referencing and analysing huge amounts of data and reporting on your findings is what we understand to be big data publishing. It's the new buzzword in publishing because it has the potential to open up new opportunities for making money with content. But big data analysis is extremely difficult and expensive.

Big data analysis in Europe: we saw one of the earliest examples when a dozen or so journalists from across the EU decided to expose the biggest fraudsters of the Community. They worked with data that linked organisations to bank savings accounts in the Bahamas and other tax shelters. The story that followed was supposed to nail these organisations to the wall.

The journalists were lauded for their effort. Politicians made brave statements about tackling tax fraud. But it wasn't all glorious. First of all, mistakes were made. For example, much to the amusement of some, Belgian journalists exposed their government's third countries' development fund that seemed to have parked money in a tax shelter because this allowed them to keep 100% of the interest on the fund, which was declared to be "good management".

But there's more with big data that might be less applaudable. Just as with statistics, big data can prove much any sort of idea, depending on how it's analysed. A commonly used method is statistical inference, which gives big data analysis a certain predictive capacity. Other methods may border on the black box model as used by financial analysts, where the analyst doesn't really know what the model does exactly.

Big data publishing

Big Data publishing is like a labyrinth without an entry or exit. You don’t know the accuracy or relevance of the data and the stats can prove anything.

Mistakes and the use of methods that are obscure to the common folk… Publishers who want to use big data to make their content more "payment-prone", should be aware that it can backfire in nasty ways. First there's the cost. But then there are other reasons why you should be very careful.

According to an article in the WSJ, experts who specialise in computer-driven analysis of large streams of information say too many companies throw themselves into big-data projects, only to fall into common traps and end up with nothing to show for their efforts.

While the Wall Street Journal discussed corporate usage cases that have gone awry, these cases do show that big data needs a very careful approach.

How about data quality and relevance? As Tamoor Zubair at Internet Evolution states: poor or missing data can ruin the analytics. How can journalists guarantee the data they're working with is of a necessary high level of quality to draw relevant conclusions? The answer is in most cases: they cannot.

But Nassim Taleb at Wired hits the soft spot for me. He says: "… big data means anyone can find fake statistical relationships, since the spurious rises to the surface. This is because in large data sets, large deviations are vastly more attributable to variance (or noise) than to information (or signal). It’s a property of sampling: In real life there is no cherry-picking, but on the researcher’s computer, there is." 

And: "Large deviations are likely to be bogus. We used to have protections in place for this kind of thing, but big data makes spurious claims even more tempting. And fewer and fewer papers today have results that replicate: Not only is it hard to get funding for repeat studies, but this kind of research doesn’t make anyone a hero. Despite claims to advance knowledge, you can hardly trust statistically oriented sciences or empirical studies these days."

If you can't trust scientific papers based on big data, what are the odds you can trust journalists who aren't even trained in the discipline, to discover really valuable information that readers would love to pay for?

Big data could potentially mean big business for publishers who have the knowledge in-house to deal with it as it should be dealt with (meaning: a platoon of maths doctors). There are few who can boast to employ these brains, because they can get rich much quicker elsewhere.

Making money from big data as a publisher: perhaps to good to be true.