Big Data in publishing: pitfalls

A group of 80 journalists across multiple EU-countries decided to go on a Big Data analysis adventure. They wanted to expose tax fraudsters in the EU. The results were so-and-so. It shows that Big Data needs Big Brains.

Say you want to expose tax fraudsters without Big Data. As a journalist, you'll try infiltrate in the realms of rich people with money stashed away in countries like the Bahamas or Liechtenstein. That takes a lot of time and energy and the results could be very volatile. You could end up with a story or with nothing at all.

Say you want to position a new product on the market. You'll conduct market research — not seldom done by students harassing people while they're shopping. Based on the answers people were willing to give to the questions of your questionnaire, you'll decide how to sell your product.

While my examples are grossly oversimplified, both are more or less what old-school data analysis dictates. You go after the facts and based on the facts, you make your decisions.

What is Big Data? It’s millions of data points on businesses and consumers that are created through mostly online actions and activities. Purchase interests, social behaviour, education level, family status, financial rating… In order to collect and analyse the huge volumes of this data, you either buy an expensive service, or write a script. Big Data for the purpose of this article will mean the data generated on a daily basis by people on social media, microblogs, etc. Tweets, Facebook status updates, second hand sites, etc., etc.

Let's go back for a moment to the journalists who did research on tax fraud. After having written a netbot and a script that analysed the findings, they wrote a story exposing names of people and organisations who they found have a bank account in countries that are known to be tax shelters. The essence of the story was that these people are fraudsters of the worst kind.

After the dust had settled and a bit of time went over it, some began to criticise the findings. Some of the names they found turned out to be government institutions who parked their money in tax shelters for perfectly legal reasons.

The reason why this happened has nothing to do with the quality of the data, but with the persons using it. The storytellers jumped to conclusions that they couldn't deduct from the simple fact that organisations have bank accounts in tax shelter countries. The data they examined was in essence a list of names. The reason why these names popped up in tax shelters' accounts wasn't in the data.

They just assumed it was for illegal purposes. In the case of tax shelters, they should have further analysed each name, e.g. by digging into the history of each person/company/institution to see if they had a track record of tax evasion or if they would stand to benefit unlawfully from having a bank account in such countries.

If you analyse data, you should make sure that you can draw linear conclusions from it. If that is not the case then you overshoot your objective and introduce errors and mistakes in your analysis.

This is the danger of using Big Data as a source of knowledge. For marketing, the analysis usually doesn't need to answer the 'why' question, which is more important in journalism. For pure publishing, a blogger or journalist should not attempt to get involved with the 'why' question, unless the data is known to contain the answer.

Big Data poses new challenges, because it not only requires analysts to double-checking the facts, but also stick strictly to the facts that are there in the first place. Anything else is just as valuable as gossip and hearsay.