The “high-priests” of Big Data have spoken. Hadoop Distributed File System (HDFS) is now the de facto standard platform for data storage. You may have heard this “heresy” uttered before. But, for me, it wasn’t until the recent Strata conference that I began to really understand how prevalent this opinion actually is.
We are in a midst of drastic shift in application development landscape. Developers entering the market today use different tools and follow different patterns.
One of the core patterns of on-line application development today is cloud scale design. While traditionally architectures would rely on more powerful servers, today, that approach simply does not scale. We have reached that point where, in many cases, there are no powerful enough servers, or their cost would be prohibitive. Considering the unpredictable usage patterns today’s on-line applications also must be flexible to address demand spikes and assure efficient service during low utilization.
Over the last decade, the access to best-of-bread data technologies has become easier. This is due mainly to the increasing popularity of open source software (OSS). While this phenomenon holds true in other areas like operating systems, application servers, development frameworks or even monitoring tools, it is perhaps most prevalent in the area of data.
Over eight months ago, I joined Intel to work on their next-generation data analytics platform. In large, my decision was based on Intel’s desire to address the “voodoo magic” in Big Data: the complexities that require deep technical skills which are preventing domain experts from gaining access to large volumes of data. The idea was that by leveraging the distributed data processing capabilities of Apache Hadoop, and combining them with Intel’s breadth of infrastructure experience, we could make Big Data analytics more accessible and therefore more prevalent.
Last week I had a chance to attend the 3rd AWS re:invent conference in Vegas. I’m not a big fan of that city myself, but, as in previous years, re:invent has not disappointed. Much coverage has been dedicated to the newly introduced services; I won’t bore you with that. Instead, I wanted to share with you a few higher-level thoughts I captured at the event.
After a pretty positive experience with influxdb I wanted to create a super simple telemetry producer (this one in Node.js) to spotlight a few types of time series data query supported in influxdb. (Source code available on GitHub)
To get live data for this demo, I created a simple script that generates metric data for CPU Utilization and Free Memory on your local machine at 1 sec resolution.
Seems like every week we hear about yet another new open source Machine/Deep Learning library or Analytical Framework.
Talking to people at Strata this week only confirmed for me that in the midst of what can only be described as virtual gluttony of open-source software, there is massive number of organizations who find it increasingly harder to implement these technologies. Even the task of identifying the right solution can overwhelm many, and result in a tailspin of endless use-case/feature comparison.
As a long-term Cloud storage user I recently wanted to re-evaluate my options. New content management providers became available and I wanted to make sure I wasn’t missing on the new shinny tech out there.
As I was considering the pros and cons of each option, I realized the apparent shift in my personal attitude towards cloud data storage over last few years. My concerns used to be solely with security. Now, while the data security is still critical, I am much more interested in data access, ownership, integration and its control.
About a year and a half ago, I wrote about Big Data Opportunities, focusing primarily on Leveraging Unused Data, Driving New Value From Summary Data and Harnessing Heterogeneous Data Sensors (more recently known as Internet of Things).
Since that post, data space has exploded with numerous solutions addressing many of these areas. These solutions while mostly based on batch operations and limited to serial MapReduce jobs against frequently off-line, inadequately secured, Hadoop cluster, they do allow access to previously inaccessible data.
In general, Platform as a Service (PaaS) is developed by developers for developers. Of course they’re going to love it.
It enables them to focus on the nuances of their applications — not on the day-to-day pointless activities that so often take their time away from solving real problems.