A collection of news and views from authors at Red Hat

Daniel Thompson

Subscribe to Daniel Thompson: eMailAlertsEmail Alerts
Get Daniel Thompson: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Big Data in Theory

What is it? It’s big data. Right?

I’m not sure if I like the term Big Data. I think it’s right up there with the term Cloud.

I do, however, like the framework created by Doug Laney: Volume, Velocity, and Variety. It’s the de facto description of Big Data, and it predates the Big Data phenomenon. That, and I like both alliteration and the KISS principle. Who doesn’t?

Here is my, albeit short, interpretation of the 3Vs.

Volume – More data.
Velocity – Data (in), faster. Information (out), faster.
Variety – More data sources and / or formats.

What about the Flying V?

Thinking about the 3Vs reminded me of the Flying V.

Then it occurred to me…

The Flying V worked in The Mighty Ducks. Yes, I watched The Mighty Ducks. It did not work in D2. Yes, I watched the sequel. No, I did not watch D3. I can only hope that it did not do to The Mighty Ducks what Alien 3 did to Alien.

Update It’s come to my attention that not everyone has seen The Mighty Ducks. The Ducks are a youth ice hockey team. I’ve been told that ice hockey is not the only hockey. Really? The Flying V is their trick play. It’s like how the option offense in college football (NCAA) does not work in professional football (NFL).

The 3Vs are a valid description of Big Data in theory, but they are not a valid description of Big Data in practice. Perhaps it is because they state the obvious, hint at the problem, and do not mention the solution.

Big Data in Practice

Volume

Volume is addressed with distributed storage using a shared nothing architecture on commodity hardware.

Examples

  • Distributed File System – Red Hat Storage, Hadoop Distributed File System
  • NoSQL – MongoDB
  • In-Memory Data Grid – JBoss Data Grid

Velocity

Outgoing information is generated faster with parallel processing in the form of batch processing (e.g. map / reduce), near real-time processing (e.g. distributed tasks), and real-time processing (e.g. stream processing).

Examples

  • Map / Reduce Tasks – JBoss Data Grid, NoSQL, Hadoop MapReduce
  • Distributed Tasks – JBoss Data Grid
  • Stream Processing – Storm  / S4

Data Locality

Volume and velocity are often two sides of the same coin. Incoming data is stored faster using distributed storage. While outgoing information is generated faster with parallel processing, it is often done in conjunction with distributed storage via data locality. The parallel processes are executed on the distributed storage nodes.

Examples

Apache Hadoop (HDFS + MapReduce), JBoss Data Grid

Variety

Variety is addressed with NoSQL for structured / semi-structured data and distributed file systems for unstructured data.

Examples

  • Key / Value Store – JBoss Data Grid
  • Document Store – MongoDB
  • Column Oriented Store – Apache HBase (Hadoop)
  • Hierarchical Store – ModeShape

Additional Thoughts

It’s true. I liked The Mighty Ducks. I was a kid. That being said, it’s not The Goonies. If The Goonies is on television, I watch it for the nth time. If The Mighty Ducks is on television, I put in Serenity (BD) and watch it for the nth time.

Alien and Aliens are two of the greatest films ever. Period.


Read the original blog entry...

More Stories By Daniel Thompson

I curate the content on this page, but the credit goes to my talented colleagues for the posts that you see here. Much of what you read on this page is the work of friends at How to JBoss, and I encourage you to drop by the site at http://www.howtojboss.com for some of the best JBoss technical and non-technical content for developers, architects and technology executives on the Web.