There’s data, and then there’s big data. So, what’s the difference?
Big data defined
A clear big data definition can be difficult to pin down because big data can cover a multitude of use cases. But in general the term refers to sets of data that are so large in volume and so complex that traditional data processing software products are not capable of capturing, managing, and processing the data within a reasonable amount of time.
These big data sets can include structured, unstructured, and semistructured data, each of which can be mined for insights.
How much data actually constitutes “big” is open to debate, but it can typically be in multiples of petabytes—and for the largest projects in the exabytes range.
Often, big data is characterized by the three Vs:
- an extreme volume of data
- a broad variety of types of data
- the velocity at which the data needs to be processed and analyzed
The data that constitutes big data stores can come from sources that include web sites, social media, desktop and mobile apps, scientific experiments, and—increasingly—sensors and other devices in the internet of things (IoT).
The concept of big data comes with a set of related components that enable organizations to put the data to practical use and solve a number of business problems. These include the IT infrastructure needed to support big data technologies, the analytics applied to the data; the big data platforms needed for projects, related skill sets, and the actual use cases that make sense for big data.
What is data analytics?
What really delivers value from all the big data organizations are gathering is the analytics applied to the data. Without analytics, which involves examining the data to discover patterns, correlations, insights, and trends, the data is just a bunch of ones and zeros with limited business use.
By applying analytics to big data, companies can see benefits such as increased sales, improved customer service, greater efficiency, and an overall boost in competitiveness.
Data analytics involves examining data sets to gain insights or draw conclusions about what they contain, such as trends and predictions about future activity.
By analyzing information using big data analysis tools, organizations can make better-informed business decisions such as when and where to run a marketing campaign or introduce a new product or service.
Analytics can refer to basic business intelligence applications or more advanced, predictive analytics such as those used by scientific organizations. Among the most advanced type of data analytics is data mining, where analysts evaluate large data sets to identify relationships, patterns, and trends.
Data analytics can include exploratory data analysis (to identify patterns and relationships in data) and confirmatory data analysis (applying statistical techniques to find out whether an assumption about a particular data set is true.
Another distinction is quantitative data analysis (or analysis of numerical data that has quantifiable variables that can be compared statistically) vs. qualitative data analysis (which focuses on nonnumerical data such as video, images, and text).
IT infrastructure to support big data
For the concept of big data to work, organizations need to have the infrastructure in place to gather and house the data, provide access to it, and secure the information while it’s in storage and in transit. This requires the deployment of big data analytics tools.
At a high level, these include storage systems and servers designed for big data, data management and integration software, business intelligence and data analytics software, and big data applications.
Much of this infrastructure will likely be on-premises, as companies look to continue leveraging their datacenter investments. But increasingly organizations rely on cloud computing services to handle much of their big data requirements.
Data collection requires having sources to gather the data. Many of these—such as web applications, social media channels, mobile apps, and email archives—are already in place. But as IoT becomes entrenched, companies might need to deploy sensors on all sorts of devices, vehicles, and products to gather data, as well as new applications that generate user data. (IoT-oriented big data analytics has its own specialized techniques and tools.)
To store all the incoming data, organizations need to have adequate data storage in place. Among the storage options are traditional data warehouses, data lakes, and cloud-based storage.
Security infrastructure tools might include data encryption, user authentication and other access controls, monitoring systems, firewalls, enterprise mobility management, and other products to protect systems and data,
Big data technologies
In addition to the foregoing IT infrastructure used for data in general. There several technologies specific to big data that your IT infrastructure should support.
Hadoop ecosystem
Hadoop is one of the technologies most closely associated with big data. The Apache Hadoop project develops open source software for scalable, distributed computing.
The Hadoop software library is a framework that enables the distributed processing of large data sets across clusters of computers using simple programming models. It’s designed to scale up from a single server to thousands, each offering local computation and storage.
The project includes several modules:
- Hadoop Common, the common utilities that support other Hadoop modules
- Hadoop Distributed File System, which provides high-throughput access to application data
- Hadoop YARN, a framework for job scheduling and cluster resource management
- Hadoop MapReduce, a YARN-based system for parallel processing of large data sets.
Apache Spark
Part of the Hadoop ecosystem, Apache Spark is an open source cluster-computing framework that serves as an engine for processing big data within Hadoop. Spark has become one of the key big data distributed processing frameworks, and can be deployed in a variety of ways. It provides native bindings for the Java, Scala, Python (especially the Anaconda Python distro), and R programming languages (R is especially well suited for big data), and it supports SQL, streaming data, machine learning, and graph processing.
Data lakes
Data lakes are storage repositories that hold extremely large volumes of raw data in its native format until the data is needed by business users. Helping to fuel the growth of data lakes are digital transformation initiatives and the growth of the IoT. Data lakes are designed to make it easier for users to access vast amounts of data when the need arises.
NoSQL databases
Conventional SQL databases are designed for reliable transactions and ad hoc queries, but they come with restrictions such as rigid schema that make them less suitable for some types of applications. NoSQL databases address those limitations, and store and manage data in ways that allow for high operational speed and great flexibility. Many were developed by companies that sought better ways to store content or process data for massive websites. Unlike SQL databases, many NoSQL databases can be scaled horizontally across hundreds or thousands of servers.
In-memory databases
An in-memory database (IMDB) is a database management system that primarily relies on main memory, rather than disk, for data storage. In-memory databases are faster than disk-optimized databases, an important consideration for big data analytics uses and the creation of data warehouses and data marts.
Big data skills
Big data and big data analytics endeavors require specific skills, whether they come from inside the organization or through outside experts.
Many of these skills are related to the key big data technology components, such as Hadoop, Spark, NoSQL databases, in-memory databases, and analytics software.
Others are specific to disciplines such as data science, data mining, statistical and quantitative analysis, data visualization, general-purpose programming, and data structure and algorithms. There is also a need for people with overall management skills to see big data projects through to completion.
Given how common big data analytics projects have become and the shortage of people with these types of skills, finding experienced professionals might be one of the biggest challenges for organizations.
Big data analytics use cases
Big data and analytics can be applied to many business problems and use cases. Here are a few examples:
- Customer analytics. Companies can examine customer data to enhance customer experience, improve conversion rates, and increase retention.
- Operational analytics. Improving operational performance and making better use of corporate assets are the goals of many companies. Big data analytics tools can help businesses find ways to operate more efficiently and improve performance.
- Fraud prevention. Big data tools and analysis can help organizations identify suspicious activity and patterns that might indicate fraudulent behavior and help mitigate risks.
- Price optimization. Companies can use big data analytics to optimize the prices they charge for products and services, helping to boost revenue.