Generate Value out of Data
Handling Data – The storage and processing of data can become simpler to a great extent if we understand the true nature of data. This is the essence of Data Modelling.Big Data Structure. Knowingly or not, we deal with data on daily basis. We generate data and we consume data. The amount of data generation is increasing multi fold every day. Data storage, processing, and management thus become a job which holds prime importance and value. The essence to do this is if we can understand the true nature of data in the initial phases and then apply big data solutions and analytics to obtain the right measures and quality from data.
Why Big Data Modelling?
Understanding the semantics of data is one of the biggest key challenges. The size of the data is only one deterministic factor that can ensure a better or worse data modelling. Prior to analysing data, there needs to be a paradigm to meaningfully determine the characteristics of data. The source of data varies massively and therefore there cannot be a defined set of manual instructions to model data in one right way but there are approaches and methodologies, if followed efficiently, will improve the quality of data results.
There can be primarily three sources of data, to begin with.
- Sources of Data
Human generated data refers to the vast amount of social media data, status updates, tweets, photos, and videos. Information produced by digital or mechanical devices falls under the Machine generated data. Organization generated data generally refers to a more traditional type of data including transaction information databases and structure data often stored in data warehouses. The real value of any Big Data application can, therefore be realized from integrating different types of data sources and analyzing them at scale.
2. Characteristics of Big Data
Volume, variety, and velocity are the main dimensions which characterize a big data set and describe its challenges. Have huge amounts of data are supplied in different formats and quality which need to be processed quickly. Veracity may refer to any noise and/or abnormality or the unmeasurable certainty present in data. Valence refers to the connectedness of big data. All of this when applied, can then help to derive the real value from data.
3. The From and To of Big Data Modelling
This step refers to the preliminary course where we must identify the initial data sets. Retrieving and querying of data(raw input data) also fall under ‘Acquire’.
3.2. Prepare Data
Understanding the nature of data and carrying out preliminary analysis. At this step, we can pre process data to suit our enterprise’s EDH layers.
3.3. Analyze Data
An analytical technique of sort should now be decided upon to work on the processed data set. We can build models around the data that will deduce the quality from data.
- Select Analytics Technique
- Build Models
3.4. Communicate Results
The hence obtained results must be now communicated to the end party to look upon. There can be re-cycles to improve the analytics that was applied.
Application of result to solve the problem statement is when we achieve the purpose of data modelling.
4. Big Data Technology Stack
Looking at the small number of Hadoop stack components, we can already see that most of them are dedicated to data modelling and efficient processing of the data. Therefore, Data Modelling sits at the heart to achieve Big Data solutions.
5. Big Data Infrastructure
5.1. Ingestion: Ingestion means the process of getting the data into the data system that we are building or using. Data ingestion Automation should be an integral part of a big data system. Especially when it involves storing fast data.
a. Ingestion Infrastructure:
i. Questions To Ask:
ii. 2 Extreme Use Cases
So, the answers to ingestion infrastructure may vary extremely depending upon the input source. Our data modeling approach should be able to consume this range of variance among data sets smoothly.So, the answers to ingestion infrastructure may vary extremely depending upon the input source. Our data modeling approach should be able to consume this range of variance among data sets smoothly.
5.2. Storage: The first is the issue of capacity. How much storage should we allocate? That means, what should be the size of the memory, how large and how many disk units should we have, and so forth. There is also the issue of scalability. Should the storage devices be attached directly to the computers to make the direct IO fast but less scalable? Or should the storage be attached to the network that connects the computers in the cluster? This will make disk access a bit slower but allows one to add more storage to the system easily
a. Storage Infrastructure: Using SSDs speed up all lookup operations in data by at least a factor of ten over hard drives. Of course, the flip side of this is the cost factor. The components become increasingly more expensive as we go from the lower layers of the pyramid to the upper layers. So ultimately, it becomes an issue of cost-benefit tradeoff.
5.3. Quality: We may, in essence, stole the data efficiently. But is it any good? Are there ways of knowing if the data is error free and useful for the intended purpose? This is the issue of data quality.There are many reasons why any data application, especially larger applications need to be mindful of data quality. Data Quality helps us achieve and retain QUALITY: We may, in essence, stole the data efficiently. But is it any good? Are there ways of knowing if the data is error free and useful for the intended purpose? This is the issue of data quality.There are many reasons why any data application, especially larger applications need to be mindful of data quality. Data Quality helps us achieve and retain.
5.4. Operations: Operations on the data set define the tasks that need to be applied on data in order to generate the high order outcome.OPERATIONS: Operations on the data set define the tasks that need to be applied on data in order to generate the high order outcome.
a. Operations Like:
b. Efficiency of Data Operations : Every operator must be efficient. That means every operator must perform its task as fast as possible by taking up as little memory, or our disk, as possible.
6. Data Models, Operations & Constraints
Structured and Unstructured
Data Models define the characteristics of data. Let’s consider an example of a
lname : string,
There can also be in an incoming unstructured data. Not to forget, there are often occurrences of an unknown data structured.An Example of unstructured data can be image files, mp3 etc.
The basic Operations that can be performed on data like subsetting, union, projection, join. All these operations help us break down or combine the units of data/ collection of data to create meaningful deductions. Example of a ‘Union’ operation: Performing a union on two data collections, eliminates the duplicate elements of the two inputs and creates a new dataset.
Constraints are logical statements that must hold true for data. The traditional include but are not limited to Value Constraints, Uniqueness Constraint, Cardinality and Type Constraint.
Vector and Graph Data Model:
Vector data model Let’s consider the input data is in form of Text. Text data is a classic example of unstructured data in its true sense. How would we model such data where there are abnormal line breaks, punctuation marks, and strings? In order to create a structure from such input data, Vector data model was introduced. We split the text into a number of documents and analyse them on some predefined parameters to search for a term in the entire input data. This is called as Document Vector and can be exemplified as :
Term frequency is the number of occurrence of that term in the respective document.
A Graph data model can help establish relationships between the different entities and properties within the dataset. We can then extend this data model by adding attributes of our own that can enrich the data operations to be performed.
Other forms of data models: An array can help us model data too.
7. Data Streams
Live data is known as streaming data. Creating solutions for the big data in action is a challenging task but one that holds utmost value if done properly. Streaming data is near real time data and may require independent computations to be performed.
Big Data is complex and introduces a new challenge every day. The only way to develop better big data solution applications is by acknowledging the variety of big data problems and addressing them. The start of a better big data solution app is a better big data modelling perception. We must lay weight in understanding the nature of data. We must apply varying test cases on data to validate it against variable conditions. Once we can model data in an efficient manner, we can definitely generate better value out of data.
Datametica Solutions Private Limited