data is distinct as large amount of data which process through new technologies
and architectures so that it becomes possible to find useful information
from it by capturing and analysis process. Big data due to its various
properties like volume, velocity, variety, variability, value, complexity and
performance put forward many challenges. Testing of this huge volume of data is a big
challenge. With the emergence of social media, cloud and smart phones,
industries have to deal with the voluminous data. while Big Data provide
solutions to complex business problems like analysis of huge data serves as a
basis for faster and better decision making, new products and services are
being developed for the customers. Many organizations are facing
challenges in facing test strategies for structured and unstructured data
validation, setting up optimal test environment, working with non relational
database and performing non functional testing. These challenges cause poor
quality of data in production, delay in implementation and increase in cost. In
this report highlighted quality attributes which have to support throughout the
system which process store and visualize the data
Big data is an all-encompassing
term for any collection of data sets so large and complex that it becomes
difficult to process them using traditional data processing applications. Big
data usually includes data sets with sizes beyond the ability of commonly used
software tools to capture, curate, manage, and process data within a tolerable
elapsed time. Big data “size” is a constantly moving target, as of
2012 ranging from a few dozen terabytes to many petabytes of data. Big data is
a set of techniques and technologies that require new forms of integration to
uncover large hidden values from large datasets that are diverse, complex, and
of a massive scale. Big data uses inductive statistics and concepts from
nonlinear system identification to infer laws from large sets of data with low
information density to reveal relationships, dependencies and perform predictions
of outcomes and behaviors1.
Fig : 1
Testing Big data is one of
biggest challenge faced by every organization because of lack of knowledge on
what to test and how to test. Biggest challenges faced in defining test
strategies for structured and unstructured data validation, setting up an
optimal test environment, working with non relational database and performing
non –functional testing. These challenges cause poor quality of data in
production and delayed implementation and increase in cost2.
Given its current popularity, the
definition of big data is rather diverse, and reaching a consensus is
difficult. Fundamentally, big data means not only a large volume of data but
also other features that differentiate it from the concepts of “massive data”
and “very large data”. In fact, several definitions for big data are found here,
“Big data technologies describe a
new generation of technologies and architectures, designed to economically
extract value from very large volumes of a wide variety of data, by enabling
high-velocity capture, discovery, and/or analysis.” This definition delineates
the four salient features of big data, i.e., volume, variety, velocity and
value. As a result, the “4Vs” definition has been used widely to characterize
Even on the Web, where
computer-to-computer communication ought to be straightforward, the reality is
that data is messy. Different browsers send differently formatted data, users
withhold information, and they may be using different software versions or
vendors to communicate with you. Traditional relational database management
systems are well-suited for storing transactional data but do not perform well
with mixed data types. In response to the “three Vs” challenge, Hadoop, an open
source software framework, has been developed by a number of contributors.
Hadoop is designed to capture raw data using a cluster of relatively
inexpensive, general-purpose servers. Hadoop achieves reliability that equals
or betters specialized storage equipment by managing redundant copies of each
file, intentionally and carefully distributed across the cluster3.
This secondary data has been
collected from previous research paper in which they discussed testing
strategies for big data. Different testing types like functional and non
functional testing are required along with strong test data and test
environment management to ensure that the data from varied sources is processed
error free and can obtained good quality to perform analysis. In the following the flow of data in a big data
Fig : 2
can come into big data systems from different sources like sensors, IoT
devices, scanners, CSV, census information, logs, social media, and from
traditional databases etc.
application have to deal with huge data se and have to clean data,
validate before forward it.
framework deals with huge data set in petabytes and more.
will have to verify that the data has been properly imported into Hadoop.
validation comes with the testing of correctness and completeness of data.
can validate source data by the knowledge of SQL because the source data
application will work on the data in Hadoop and procedure it as per the
test the application we use test data. But the data in Hadoop is huge we
cannot all data to test. We make subset to test data, which we call test
will also run same procedure to test data as the user required.
that we compare the results of the processing from this big data
application to confirm that the application proceed the data correctly.
process the test data we require some knowledge of hive, pig scripting, python
data applications has been developed to process large data set. For
example: if you are working for Facebook and the developers have developed an
application where the comments contain the phrase “nice profile picture”
is marked as spam. This is an smple example, usually the applications are
more complex and include classifying patterns in data and about to happen
with predictions using algorithm to discriminate spam comments from real
process the data we store it in data warehouse.
the data is stored in the data warehouse it could be validated once more
to guarantee that it aligns with the data that was generated after
processing, by the big data application.
data from the data warehouse is generally analyzed and shown in a visual
format so that Business Intelligence (BI) can be added from it5.
visual representation of data, it will have to be validated
services may be used in order to transfer the data from the data warehouse
to the BI system. In such cases, the web services will also be tested4.
the architecture of the big data application is not good then performance
bottleneck in the process leads to unavailability of data. This can in turn
impact the success of the project. Performance testing of the
system is required to avoid the above issues. Here we measure metrics like
throughput, memory utilization, CPU utilization, time taken to complete a task
etc. It is also difficult to automate the whole
process because this architecture deals with different types of data so the
unit testing is done in this paradigm.
Data testing challenges
compared to traditional testing big data architecture of the testing environment is different. To test big data process is slightly
different from testing any process. There is unstructured data which requires a rethinking
of the validation strategies, so dull work is transformed into R&D.
Sampling data for testing is no longer an option, but a challenge, to ensure
representability of data for the entire batch.
Quality factors for big data application validation
Fig : 3
System Data Security –This
parameter could be used to evaluate the security of big data based system in
different perspectives. Using this parameter, data security could be evaluated
in various perspectives at the different levels.
System Robustness – This
parameter evaluates the ability of a system to resist change without adapting
its initial stable configuration6.
System Consistency –This
parameter indicates the performance of the system, such as availability, response
time, throughout, scalability, etc.
System Correctness –Correctness is
related to the prediction pattern or model. For instance, some models are more
likely used to predict point of inflexion values while some other models are
doing well in predicting continuity. Thus, in order to verify the correctness
of the system effectively, engineers need to evaluate the capability of
prediction in the specified conditions and environments.
System Completeness, which is
used to validate either data is complete or not. Some big data applications are
developed to find previously unknown answers, thereby only approximate
solutions might be available6.
As described by the 4Vs, the lack of knowledge
and the technical training of the testers as well as from the costs associated
The rising problem of cyberterrorism makes
security testing an integrated component of any testing suite. Any
vulnerability in the infrastructure gathering the Big Data, such as Wi-Fi or
sensors, could be exploited to get access to the data lake and compromise the
organization’s records. Hadoop framework mostly consider quite insecure.
Therefore, a data penetration test is required.
Big Data testing should address each of the
problems raised by the 3Vs, to create the fourth – value. There are significant
differences between standard software and data testing related to
infrastructure, tools, processes and existing know-how. To cope with the
challenges posed by Big Data, testers need to use parallel computing, automate
testing and keep in mind issues related to data privacy and security. A clear
testing strategy for database testing and infrastructure followed by
performance and functional testing types also helps.
1. Kaur, P.D., A. Kaur, and S. Kaur, Performance Analysis in Bigdata.
International Journal of Information Technology and Computer Science (IJITCS),
2015. 7(11): p. 55-61.
J., C.d.l. Riva, and J. Tuya. Testing
data transformations in MapReduce programs. in Proceedings of the 6th International Workshop on Automating Test Case
Design, Selection and Evaluation. 2015. ACM.
J., et al., Structure and patterns of
cross-national Big Data research collaborations. Journal of Documentation,
2017. 73(6): p. 1119-1136.
V.D., Software Engineering for Big Data
Systems, 2017, University of Waterloo.
M., et al., Big data: Testing approach to
overcome quality challenges. Big Data: Challenges and Opportunities, 2013:
C., et al., A Practical Study on Quality
Evaluation for Age Recognition Systems.