Source: 6 myths about big data
Advances in cloud computing, data processing speeds, and the huge amount of data input from sources like IoT mean that companies are now collecting previously unseen amounts of data. Big data is now bigger than ever. But organizing, processing, and understanding the data is still a major challenge for many organizations.
Is your company still struggling to understand what big data is, and how to manage it? Here are 6 myths about big data, from the experts, to help you separate truth from fiction in the realm of big data.
1. Big data means ‘a lot’ of data
Big data is a buzzword these days. But what it really means is still often unclear. Some people refer to big data as, simply, a large amount of data. But, that’s not quite correct. It’s a little more complex than that. Big data refers to how data sets, either structured (like Excel sheets) or unstructured (like metadata from email) combine with data like social media analytics or IoT data to form a bigger story. The big data story shows trends about what is happening within an organization—a story that is difficult to capture with traditional analytic techniques.
Jim Adler, head of data at Toyota Research Institute, also makes a good point: Data has a mass. “It’s like water: When it’s in a glass, it’s very manageable. But when it’s in a flood, it’s overwhelming,” he said “Data analysis systems that work on a single machine’s worth of data will be washed away when data scales grow 100 or 1000 times. So, sure, prototype in the small, but architect for the large.”
2. The data needs to be clean
“The biggest myth is you have to have clean data to do analysis,” said Arijit Sengupta, CEO of BeyondCore. “Nobody has clean data. This whole crazy idea that I have to clean it to analyze doesn’t work. What you do is, you do a ‘good enough’ analysis. You take your data, despite all the dirtiness, and you analyze it. This shows where you have data quality problems. I can show you some patterns that are perfectly fine despite the data quality problems. Now, you can do focused data quality work to just improve the data to get a slightly better insight.”
Megan Beauchemin, director of business intelligence and analytics for InOutsource, agreed. “Often times, organizations will put these efforts on the back burner, because their data is not clean. This is not necessary. Deploying an analytic application will illuminate, visually, areas of weakness in data,” she said. “Once these shortfalls have been identified, a cleanup plan can be put into place. The analytic application can then utilize a mechanism to highlight clean-up efforts and monitor progress.”
SEE: Job description: Big data modeler (Tech Pro Research)
“If your data is not clean, I think that is all the more reason to jump in,” Beauchermin said. “Once you tie that data together, and you’re bringing it to life visually in an application where you’re seeing those associations and you’re seeing the data come together, you’re going to very quickly see shortfalls in your data.” Then, she said, you can see where the data issues lie, offering a benchmark as you clean the data up.
3. Wait to make your data perfect
Here’s another reason you shouldn’t wait to clean up your data: “By the time you’ve cleaned your data, it’s three months old—so you have stale data,” said Sengupta. So, the information is no longer relevant.
Sengupta spoke about a conference where Josh Bartman, from the First Interstate Bank, brought up an important point. “Josh showed how he was running an analysis, finding a problem, changing the analysis, rerunning the analysis. He said, ‘Look, my analyses are only about four to five minutes apart. So, if I can run an analysis, find the problem, fix the problem, rerun the analysis and see the report in four or five minutes later, that changes the nature of how I approach analysis.'”
SEE: 10 big data insiders to follow on Twitter (TechRepublic)
Sengupta compared it to the old way of coding. “I get everything right, then I code. But now, everybody does agile coding,” he said. “You write something, you test it, you see how you can make it better, then you make it better. The world has changed and people are still acting like it’s the old way of doing things.”
4. The data lake
Data lakes, which are, loosely, storage repositories holding large amounts of raw structured and structured data, are frequently referred to in the context of big data.
The only problem is, despite how often they are cited, they don’t exist, Adler said.”An organization’s data isn’t dumped into a data lake. It is carefully curated in a departmental ‘data silo’ that encourages focused expertise. They also provide the accountability and transparency needed for good data governance and compliance.”
5. Analyzing data is expensive
Are you afraid to get started on the data because of the presumed expense involved in data analysis tools? There’s good news for you: With the free data tools available today, anybody can get started with analyzing big data.
Also, according to Sengupta, the low cost of today’s cloud computing means “you can actually do things that were never possible.”
6. Machine algorithms will replace human analysts
Sengupta sees an interesting dichotomy in terms of approaches to analyzing big data. “There’s a split, where on one side there are people who are saying, ‘I’m going to throw thousands of data scientists at that problem.’ Then, there are people who are saying, ‘Machine learning is going to do it all. It’s going to be completely automated,'” he said.
But, Sengupta doesn’t think either of those solutions work. “There aren’t enough data scientists, and the cost is going up fast,” he said. “Also, business users have years of domain log-ins and intuition about their business. When you bring a data scientist in and say, ‘That guy’s going to do it and tell you what to do,’ that actually creates the exact wrong kind of friction which prevents adoption of those insights. Data scientists often can’t learn enough about our business to be really smart about the business immediately.”
The “perfect” data scientist, who understands exactly how a specific business works, how its data works, is a myth, said Sengupta. “That person doesn’t exist.”
In reality, Sengupta said, “most data science projects actually don’t get implemented because it’s so hard. It takes months to get done and, by the time it’s done, the question you care about is already too old.”
But, there are also problems with relying too heavily on the machine learning. “It’s giving me an answer but not an explanation. It’s telling me what to do, but not why I should be doing it,” he said. “People don’t like being told what to do, especially by magical machines.”
The key, he said, is not just the answers—it’s the explanations and recommendations.
On one hand, he said, data scientists will become more and more specialized on the really hard problems. “Think the time when every department and company started a data processing department and number processing department. Fortune 500 companies had ‘Data Processing Departments’ and ‘Number Processing Departments.’ They basically became Excel, Word, and PowerPoint.”
Still, people are experts in data and number processing.
“If I go to Morgan Stanley, believe me, there are still people who are experts in data processing and number processing. They still exist. They have different titles and different jobs but, in really advanced cases, those people still exist. But 80-90% will have moved to Excel, Word, and PowerPoint. That’s how the world, in terms of big data, should evolve.”