What is Big Data? What Issues Companies Are Facing and how it can be Resolved Using Big Data

Yagyandatta Murmu
11 min readSep 17, 2020

Some of the commonly faced issues include inadequate knowledge about the technologies involved, data privacy, and inadequate analytical capabilities of organizations. A lot of enterprises also face the issue of a lack of skills for dealing with Big Data technologies. In this post, I would try my best to explain Big Data in the simplest way I can.

Data Definition And Meaning!

What is Data?

Data are characteristics or information, usually numerical, that are collected through observation. In a more technical sense, data are a set of values of qualitative or quantitative variables about one or more persons or objects, while a datum is a single value of a single variable.

Big Data

What Is Big Data ?

Big data is a term that refers to the large, diverse sets of information that grow at ever-increasing rates. It encompasses the volume of information, the velocity or speed at which it is created and collected, and the variety or scope of the data points being covered. It has become a domain in IT sector consisting of the technologies dealing with large data that is generated every second.

It’s not any technology but a problem which has arisen due to huge generation of data in today’s world.

How Big Data Works

Big data can be categorized as unstructured or structured. Structured data consists of information already managed by the organization in databases and spreadsheets; it is frequently numeric in nature. Unstructured data is information that is unorganized and does not fall into a pre-determined model or format. It includes data gathered from social media sources, which help institutions gather information on customer needs.

Three Vs traditionally characterize big data: the volume (amount) of data, the velocity (speed) at which it is collected, and the variety of the info.

Big data can be collected from publicly shared comments on social networks and websites, voluntarily gathered from personal electronics and apps, through questionnaires, product purchases, and electronic check-ins. The presence of sensors and other inputs in smart devices allows for data to be gathered across a broad spectrum of situations and circumstances.

The statistic shows that 500+terabytes of new data get ingested into the databases of social media site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc.

Advantages and Disadvantages of Big Data

In general, having more data on one’s customers should allow companies to better tailor their products and marketing efforts in order to create the highest level of satisfaction and repeat business. Companies that are able to collect a large amount of data are provided with the opportunity to conduct deeper and richer analysis.

While better analysis is a positive, big data can also create overload and noise. Companies have to be able to handle larger volumes of data, all the while determining which data represents signals compared to noise. Determining what makes the data relevant becomes a key factor

  • Big data is a great quantity of diverse information that arrives in increasing volumes and with ever-higher velocity.
  • Big data can be structured (often numeric, easily formatted, and stored) or unstructured (more free-form, less quantifiable).
  • Nearly every department in a company can utilize findings from big data analysis, but handling its clutter and noise can pose problems

Characteristics Of Big Data

Big data can contain different kinds of information such as text, video, financial data, and logs, as well as secure or insecure information. Thus, the treatment of these different sources of information should not be the same. Furthermore, the proposed classification method should take the following factors into consideration.

Volume

The most obvious one is where we’ll start. Big data is about volume. Volumes of data that can reach unprecedented heights in fact. It’s estimated that 2.5 quintillion bytes of data is created each day, and as a result, there will be 40 zettabytes of data created by 2020 — which highlights an increase of 300 times from 2005. As a result, it is now not uncommon for large companies to have Terabytes — and even Petabytes — of data in storage devices and on servers. This data helps to shape the future of a company and its actions, all while tracking progress.

Velocity

The growth of data, and the resulting importance of it, has changed the way we see data. There once was a time when we didn’t see the importance of data in the corporate world, but with the change of how we gather it, we’ve come to rely on it day to day. Velocity essentially measures how fast the data is coming in. Some data will come in in real-time, whereas other will come in fits and starts, sent to us in batches. And as not all platforms will experience the incoming data at the same pace, it’s important not to generalize, discount, or jump to conclusions without having all the facts and figures.

Variety

Data was once collected from one place and delivered in one format. Once taking the shape of database files — such as, excel, CVS and access — it is now being presented in non-traditional forms, like video, text, PDF, and graphics on social media, as well as via tech such as wearable devices. Although this data is extremely useful to us, it does create more work and require more analytical skills to decipher this incoming data, make it manageable and allow it to work.

Big Data is much more than simply ‘lots of data’. It is a way of providing opportunities to utilize new and existing data, and discovering fresh ways of capturing future data to really make a difference to business operatives and make it more agile.

Big data challenges and ways to solve them

1.Insufficient understanding and acceptance of big data:

Oftentimes, companies fail to know even the basics: what big data actually is, what its benefits are, what infrastructure is needed, etc. Without a clear understanding, a big data adoption project risks to be doomed to failure. Companies may waste lots of time and resources on things they don’t even know how to use.

And if employees don’t understand big data’s value and don’t want to change the existing processes for the sake of its adoption, they can resist it and impede the company’s progress.

Solution:

Big data, being a huge change for a company, should be accepted by top management first and then down the ladder. To ensure big data understanding and acceptance at all levels, IT departments need to organize numerous training and workshops.

To see to big data acceptance even more, the implementation and use of the new big data solution need to be monitored and controlled. However, top management should not overdo with control because it may have an adverse effect.

2.Confusing variety of big data technologies:

It can be easy to get lost in the variety of big data technologies now available on the market. Do you need Spark or would the speeds of Hadoop MapReduce be enough? Is it better to store data in Cassandra or HBase? Finding the answers can be tricky. And it’s even easier to choose poorly, if you are exploring the ocean of technological opportunities without a clear view of what you need.

Solution:

If you are new to the world of big data, trying to seek professional help would be the right way to go. You could hire an expert or turn to a vendor for big data consulting. In both cases, with joint efforts, you’ll be able to work out a strategy and, based on that, choose the needed technology stack.

What is Hadoop?

Apache Hadoop is an open source software framework used to develop data processing applications which are executed in a distributed computing environment.

Applications built using HADOOP are run on large data sets distributed across clusters of commodity computers. Commodity computers are cheap and widely available. These are mainly useful for achieving greater computational power at low cost.

Similar to data residing in a local file system of a personal computer system, in Hadoop, data resides in a distributed file system which is called as a Hadoop Distributed File system. The processing model is based on ‘Data Locality’ concept wherein computational logic is sent to cluster nodes(server) containing data. This computational logic is nothing, but a compiled version of a program written in a high-level language such as Java. Such a program, processes data stored in Hadoop HDFS.

  1. Hadoop MapReduce: MapReduce is a computational model and software framework for writing applications which are run on Hadoop. These MapReduce programs are capable of processing enormous data in parallel on large clusters of computation nodes.
  2. HDFS (Hadoop Distributed File System): HDFS takes care of the storage part of Hadoop applications. MapReduce applications consume data from HDFS. HDFS creates multiple replicas of data blocks and distributes them on compute nodes in a cluster. This distribution enables reliable and extremely rapid computations.

Hadoop Architecture

Master-Slave Model

Hadoop has a Master-Slave Architecture for data storage and distributed data processing using MapReduce and HDFS methods.

NameNode:

NameNode represented every files and directory which is used in the namespace.

DataNode:

DataNode helps you to manage the state of an HDFS node and allows you to interacts with the blocks

MasterNode:

The master node allows you to conduct parallel processing of data using Hadoop MapReduce.

Slave node:

The slave nodes are the additional machines in the Hadoop cluster which allows you to store data to conduct complex calculations. Moreover, all the slave node comes with Task Tracker and a DataNode. This allows you to synchronize the processes with the NameNode and Job Tracker respectively.

3.Paying loads of money

Big data adoption projects entail lots of expenses. If you opt for an on-premises solution, you’ll have to mind the costs of new hardware, new hires (administrators and developers), electricity and so on. Plus: although the needed frameworks are open-source, you’ll still need to pay for the development, setup, configuration and maintenance of new software.

If you decide on a cloud-based big data solution, you’ll still need to hire staff (as above) and pay for cloud services, big data solution development as well as setup and maintenance of needed frameworks.

Moreover, in both cases, you’ll need to allow for future expansions to avoid big data growth getting out of hand and costing you a fortune.

Solution:

The particular salvation of your company’s wallet will depend on your company’s specific technological needs and business goals. For instance, companies who want flexibility benefit from cloud. While companies with extremely harsh security requirements go on-premises.

There are also hybrid solutions when parts of data are stored and processed in cloud and parts — on-premises, which can also be cost-effective. And resorting to data lakes or algorithm optimizations can also save money:

  1. Data lakes can provide cheap storage opportunities for the data you don’t need to analyze at the moment.
  2. Optimized algorithms, in their turn, can reduce computing power consumption by 5 to 100 times. Or even more.

4.Complexity of managing data quality

Data from diverse sources

Sooner or later, you’ll run into the problem of data integration, since the data you need to analyze comes from diverse sources in a variety of different formats. For instance, e-commerce companies need to analyze data from website logs, call-centers, competitors’ website ‘scans’ and social media. Data formats will obviously differ, and matching them can be problematic. For example, your solution has to know that skis named SALOMON QST 92 17/18, Salomon QST 92 2017–18 and Salomon QST 92 Skis 2018 are the same thing, while companies ScienceSoft and Sciencesoft are not.

Unreliable data

Nobody is hiding the fact that big data isn’t 100% accurate. And all in all, it’s not that critical. But it doesn’t mean that you shouldn’t at all control how reliable your data is. Not only can it contain wrong information, but also duplicate itself, as well as contain contradictions. And it’s unlikely that data of extremely inferior quality can bring any useful insights or shiny opportunities to your precision-demanding business tasks.

Solution:

There is a whole bunch of techniques dedicated to cleansing data. But first things first. Your big data needs to have a proper model. Only after creating that, you can go ahead and do other things, like:

  • Compare data to the single point of truth (for instance, compare variants of addresses to their spellings in the postal system database).
  • Match records and merge them, if they relate to the same entity.

5.Dangerous big data security holes

Quite often, big data adoption projects put security off till later stages. And, frankly speaking, this is not too much of a smart move. Big data technologies do evolve, but their security features are still neglected, since it’s hoped that security will be granted on the application level. And what do we get? Both times (with technology advancement and project implementation) big data security just gets cast aside.

Solution:

The precaution against your possible big data security challenges is putting security first. It is particularly important at the stage of designing your solution’s architecture. Because if you don’t get along with big data security from the very start, it’ll bite you when you least expect it.

6.Tricky process of converting big data into valuable insights:

Here’s an example: your super-cool big data analytics looks at what item pairs people buy (say, a needle and thread) solely based on your historical data about customer behavior. Meanwhile, on Instagram, a certain soccer player posts his new look, and the two characteristic things he’s wearing are white Nike sneakers and a beige cap. He looks good in them, and people who see that want to look this way too. Thus, they rush to buy a similar pair of sneakers and a similar cap. But in your store, you have only the sneakers. As a result, you lose revenue and maybe some loyal customers.

Solution:

The reason that you failed to have the needed items in stock is that your big data tool doesn’t analyze data from social networks or competitor’s web stores. While your rival’s big data among other things does note trends in social media in near-real time. And their shop has both items and even offers a 15% discount if you buy both.

The idea here is that you need to create a proper system of factors and data sources, whose analysis will bring the needed insights, and ensure that nothing falls out of scope. Such a system should often include external sources, even if it may be difficult to obtain and analyze external data.

7. Troubles of up-scaling

Your solution’s design may be thought through and adjusted to up-scaling with no extra efforts. But the real problem isn’t the actual process of introducing new processing and storing capacities. It lies in the complexity of scaling up so, that your system’s performance doesn’t decline and you stay within budget.

Solution:

The first and foremost precaution for challenges like this is a decent architecture of your big data solution. As long as your big data solution can boast such a thing, less problems are likely to occur later. Another highly important thing to do is designing your big data algorithms while keeping future up-scaling in mind.

But besides that, you also need to plan for your system’s maintenance and support so that any changes related to data growth are properly attended to. And on top of that, holding systematic performance audits can help identify weak spots and timely address them.

--

--

Yagyandatta Murmu

Devops || MlOps || Flutter || Web Development || PYTHON || Data Science || AWS cloud || GCP || Azure