How big is big data?


#1

I’d like to kick the discussion off. How big is big data?

I mean GB? TB? PB?

Is it structured, unstructured or both or something else?

How do people process big data? Is HDFS a common storage pool or is it overkill in most places?


#2

I studied some theory, read some journal papers and wrote a few essays about big data at University, and I think you can describe Big Data in terms of the Three V’s - Volume, Velocity and Variety.

I believe the common conception is that big data is probably moving towards the petabytes scale. I like to think of big data as being on a scale that is currently impossible to host on a single computer/server.

The additional thing to consider is that big data isn’t just a matter of volume. You also need to consider the velocity at which data is being generated. You can have all the hard drive space in the world, but if you’re receiving data faster than your algorithms can analyse it, then you’ll ultimately run into disk space issues further down the line.

In terms of the structure of the data, it can be either or both! Variety refers to the fact that big data streams do not tend to produce data of a homogeneous format. Different data types will require different approaches for analysis. Data quality is also a source of variety that should be considered too.


#3

Good points @ryan . I’m generally of the opinion that Big Data is a pretty rubbish name as its generally unstructured data rather than Petabyte scale. If you can chuck your files into an HDFS cluster and crunch the data that way, it makes sense, but it doesn’t have to be TB’s in size.

Similarly Uber and co use MySQL at huge scale, twitter also used to, I don’t know if they still do or not, but I’m guessing thats more than 100 rows in a table…


#4

The term ”Big Data” is often very vague, with multiple definitions being provided. Big data can be analyzed for insights that lead to better decisions and strategic business moves. Before we get into my take on the topic it is important to mention the difference between structured and unstructured data.

Structured data is used to describe data that can fit nicely into relational databases. Think about a University database system holding data on its students such as Name, Student Number, DOB, Course Code, Year of Study stored in a traditional row and column table. Data of this type is relatively easy to enter, store, query and analyse once it has been defined in terms of its type and field name.

On the other hand, although unstructured data may have its own internal structure, it does not conform neatly into a spreadsheet or database like structured data does. The fundamental challenge of unstructured data sources is that they are difficult for nontechnical business users and data analysts alike to unbox, understand, and prepare for analytic use.

The definition of “Big Data” is given by SAS®, providers of an integrated system of software solutions, as the following:

“Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.”

Other definitions talk about the three V’s of Big Data.

  1. Volume – It doesn’t matter about the size of the data collected by an organisation, whether it be GB’s, TB’s or even PB’s and beyond. A large amount of data for a small company might be a tiny amount for a huge corporate company like Amazon.com. It all depends on how you define it. Nonetheless, in the past, storing these large amounts of data proved difficult. It has only been made easier with software such as Hadoop being introduced in recent years.

  2. Velocity – Data is being produced at quicker speeds than ever before and therefore it must be dealt with in a timely manner.

  3. Variety – The data being collected comes in an ever-growing range of formats, both structured and unstructured.