Hadoop and its related projects have given the database industry a wave of innovation unlike anything we’ve seen since Relational Database Management Systems (RDBMS) emerged in the late 1970s and early 1980s. I expect RDBMS will continue to dominate the world of transactional systems, where the relational model / Third Normal Form schemas are perfectly suited for the nature of transaction processing. Analytical systems, on the other hand, can clearly benefit from Hadoop with its ability to scale endlessly, which accommodates the ever-expanding volume of data to be analyzed, and its schema-on-read paradigm, which accommodates the unstructured nature of sentiment data along with the inevitable changes to what data is collected and how it’s analyzed. So, once again, I’m a huge proponent of the NoSQL movement and believe that it will ultimately dominate the world of analytical systems.
The problem with Hadoop and other NoSQL databases, however, is that there is no SQL language to use as a common and standard way of working with them. SQL came hand-in-hand with RDBMS and because of its simple, English-like syntax, it was rather trivial to learn. Granted, there were nuances to how each vendor’s optimizer worked that could dramatically affect the time it took for a SQL statement to return results. But it wasn’t long before every vendor replaced their rule-based optimizer with a cost-based optimizer, which essentially made every SQL developer an expert. Thus, with an understanding of SQL, one had the “keys to the kingdom” and the ability to do just about anything with any one of the RDBMS: Oracle, SQL Server, DB2, MySQL, PostgreSQL, etc.
Unlike RDBMS, NoSQL databases have no standard interface (i.e. No SQL). There are multiple frameworks for accessing data in Hadoop, for example. The most common of these are MapReduce (the original, tried-and-true framework), Spark (the current industry favorite for its ability to harness a cluster’s memory the way Hadoop harnesses its disk and CPU), and Solr (for search applications). Furthermore, within each framework, one can choose any number of languages, such as Java, Python, Scala, or R. Of course, there are also SQL-on-Hadoop tools that let you use SQL to access data in Hadoop, many of which also work against other NoSQL databases. But even here, there are a plethora of options to choose from: Hive (endorsed by Hortonworks), Impala (endorsed by Cloudera), and Drill (endorsed by MapR) are popular SQL-on-Hadoop tools; but there’s also Spark SQL, Presto (endorsed by Teradata), BigSQL (endorsed by IBM), HAWQ (endorsed by Pivotal), and Phoenix.
Vendors like Cloudera and Hortonworks, who are most active and most influential in the Open Source Hadoop community, should push for standardization around the frameworks, languages and tools within the Hadoop ecosystem. If standardization cannot be driven through the Apache Software Foundation (ASF), then perhaps it can through the American National Standards Institute (ANSI) or the International Organization for Standardization (ISO), as it was when they “blessed” SQL as a standard in 1986 and 1987, respectively.