Cite this as
Ahmad M, Siddiqui Z (2016) Leveraging Data Analytics by Transforming Relational Database Schema in to Big Data. Trends Comput Sci Inf Technol 1(1): 012-017. DOI: 10.17352/tcsit.000002The growth of data and its efficient handling is becoming more popular trend in recent years bringing new challenges to explore new avenues. Data analytics can be done more efficiently with the availability of distributed architecture of “Not Only SQL” NoSQL databases. Technological advancements around us are changing very rapidly and major shift is being carried out, a shift from relational to non-relational world. More precisely we are talking about the shift from traditional relational database models to non-relational database models. When moving from relational to non-relational models, database administrators face common issues due to the fact that NoSQL is a No-Schema database. Logical mapping of the schema from relational to non-relational models is complex and it is not a standard process. The purpose of conducting this research is to propose a mechanism by which the schema of a relational database management system and its data can be transformed into big data by following a set of standardize rules. This model can be very useful for relational database administrators by enabling them to focus on logical modeling instead of procedural writing for every SQL to NoSQL transition. In this paper, we studied both models and focus our research to present a set of rules and framework that can be used to apply transformation operation in a seamlessly manageable way.
HDFS: Hadoop Distributed File System; SQL: Structured Query Language; NoSQL: Not Only Structured Query Language; RDBMS: Relational Database Management Systems; ACID: Atomicity, Consistency, Isolation, Durability; BASE: Basically-available, Soft-state, Eventual Consistent; CAP: Consistency, Availability, Partition Tolerance; HUSH: HBase URL Shortner
In contrast with traditional relational databases, NoSQL database has four points of interest, that are; it is easily extendable, high in performance, provide flexible data model and high availability [1].
Previous research conducted on this topic is very generalized and does not target any specific Relational Database Management System (RDBMS) product or a practical data collection from any real-time system [2-10].
Studies have shown that this area needs standard set of rules which can help database administrators to carry out transformation task seamlessly [11].
This research is therefore unique in a way that it not only proposes a set of rules and model but also targets transformation of specific software products from Microsoft SQL Server to Hadoop H-Base based on a real world case study.
The term and idea of NoSQL was first used in 1998 by Carlo Strozzi to refer to an open source database that does not depend on SQL interface [12].
Structured Query Language (SQL) is a programming language used for storing and managing data in RDBMS. When we talk about massive information the big data domain is primarily dependent on NoSQL programming model [1]. NoSQL has been observed as an alternative to the customary RDBMS models, While SQL being utilized by industry leader database vendors such as Oracle, DB2, MS-SQL etc. [13], On the other hand there are various NoSQL models and products such as Hadoop, Hbase, MongoDB, Cassandra, Tarantool and Apache Spark [14].
To validate our framework, we implement our framework on a real world application that uses relational database for online processing and utilizes the large amount of data produced from RDBMS which further transformed into Big Data to take benefits of it. Specifically, when the data needs to be analyzed and required for taking business decisions.
NoSQL empowers application management and minimize the necessity of application modification or change in database schema. Additionally, with the expansion of data, NoSQL databases have better and simpler horizontal scalability. Those databases are equipped for taking advantages of the new clusters and nodes transparently, without requiring involvement from database administrators or manual distribution of data across different nodes [12].
RDBMS is being used for decades and still proved to be a viable solution for many use-cases. However, for modern applications, flexibility is a mandatory requirement in scaling models and data models. Due to the continuous growth in data, complex server resources are being regularly added to serve more users.
In this research, the term “Hadoop” refers to the Hadoop Distributed File System (HDFS) and in addition, refer to other open source software products and technologies developed by Apache Software Foundation and various other open source software vendors [15].
SQL Databases referred to as RDBMS (Relational Database Management Systems) is most widely used and proven approach for database solutions. In RDBMS data storage is done in a structured pattern using the tables and the relation in between them. Although the capacity of SQL allows managing huge amount of data, technically it does not provide an optimal solution to existing Big Data requirements, which requires speedy response, vast scalability and high availability [16].
With the improvement of distributed computing and cloud framework, more applications are migrating to a cloud environment in order to utilize its computing power and influence the benefits of scalability. In the initial era of distributed computing and non-relational database models, Google and Amazon were the first to propose new alternatives to data management. The lack of commercially available alternatives at that time leads to the popularity of their frameworks. The non-relational database technologies proposed are now fulfilling majority of the needs of modern software systems.
Apache HBase is such a system. It mainly utilizes to distribute open source database. Google Big Table [17], is used for the modeling purpose and provide random access to large amounts of data. Apache HBase is becoming an increasingly popular choice of database applications these days [18].
When talking about and comparing relational and non-relational database, there are two terms mainly discussed [Figure 1].
1. ACID Properties (Relational Databases)
2. BASE Properties (Non-Relational Databases)
ACID stands for Atomicity, Consistency, Isolation and Durability. This is the core principle behind the RDBMS databases and mainly used for transaction. A transaction can be defined as a logical operation on the data. ACID properties are key to ensure the integrity of the data [19].
BASE is the acronym for Basically Available, Stable state, eventually consistent. It emphasizes on Availability and Scalability using Partition tolerance, Simple and fast, good choice to use where data availability and speed is highly concern. However, it results in weak consistency.
There are less number of studies conducted for database schema conversion between relational and non-relational databases. There are even lesser examples into an automated conversion model to get you started with such scenarios [20]. This is a common scenario faced by every database administrator or developer during the transformation from relational to non-relational model [21]. This has leaded us to investigate a more systematic approach to migrate. Migration can be challenging and trivial in terms of two things; Schema Translation and Data Translation
Simple export and import between identical data stores does not solve the problem, as you need to know exactly what you are importing and why.
This paper is further organized as follows: Section III comprises of the literature review in which related studies were carried out. Section IV highlights the proposed approach. Section V contains the experimental details and evaluation of the study followed by the conclusion and future work in section VI.
For RDBMS, the architecture that is most commonly followed is a client-server based architecture in which servers are equipped to handle database application tasks. This architecture is proved to be best and it is highly optimized. However, as the data grows, the RDBMS cannot provide best results, specifically for the read-operations. Log-based and fast growing databases requires other type of database which is capable enough to support fast read-operations. NoSQL is the answer for such situations. NoSQL databases are proven to be best suited in these scenarios and provide ample control over the scalability and high availability altogether with distributed architecture by distributing the data over various nodes [22].
We studied various NoSQL databases and based on it HBase is found to be best suited for the read based operations [12].
Considering the HBase table schema design, there are basically two approaches. First one is the “Tall Narrow” design which is feasible for the data sets where table has large number of rows but less number of columns. The second approach is “Flat Wide” design and can be adopted where the table has large number of columns but less number of rows [11].
Another approach, while designing schema architecture for the Relational database, is based on a log file to store operations, configurations, modifications and query processes. The log files are usually used to monitor the database and track database wide operations. Additionally, the logged operations can be identified by the queries accessing the database. Therefore, by analyzing the log file, we can track the tables which are frequently accessed during the query processing [23].
We studied one of the database called HUSH, which provides both SQL and NoSQL version schemas [24]. HUSH is used to set up a very specific set of tables, which contains an equal and specific set of data. This functionality assists in order to easily understand what is given for transformation and how specific operation is being performed. Therefore, evaluating Hush database facilitates our study to clearly understand the schema models of both SQL and NoSQL databases.
RAD (Rapid application development): It is a market and data requirement for the applications targeting Big Data model so that the transformation can be performed in a fast and efficient manner [11].
Scalability: Is the user-demand to meet with the constantly growing throughput of the data and to access it [11].
Consistent Performance: Low response time is the key of success when handling with large amount of data and is vital for its growth [11].
Operational reliability: Built-in High Availability [11].
Very few examples of database schema conversion between relational and non-relational databases currently exist in literature and there are even fewer examples of an automated conversion tool or even anything that gets you started [20]. The topic of big data is very vast and much research has already been made, but there are avenues that need to be explored in conjunction with current ongoing research.
This research is unique in two different ways:
It has targeted a specific RDBMS platforms transformation, i.e. Microsoft SQL Server
It has targeted a real-time data of a specific business domain that operates automated machines. These machines generate very huge amount of data during the production life cycle.
After performing literature survey we found that there is no such study and experiment conducted on specific industry such as, a textile industry. The data is collected and later being analysed for business related decisions by performing data analytics. The size of the data can be measured by the actual machine-units running on 24 hours’ basis (unless any walkout occurs due to any reason like machine faults or electricity outrage).
Table 1 explains the growth of database tables during a single transaction per day and listed only huge tables. When we calculate this growth over the period of time it produces very large amount of Tera-byte data. This data requires to be analyzed for read operations in order to get in-depth knowledge and data analysis. Our practical use-case explains that the data requires only read operations once the transaction is being processed from RDBMS system. Therefore, based on the study, we choose HBase as our Non-Relational model and mapped the large tables SQL schema to NoSQL schema using our proposed approach.
Migration can be insignificant given that experts tends to follow principles. The aim of this study is to simplify and standardize the transformation process. In further lines, we have demonstrated the use of rules and perform some experiments that can be evaluated by the NoSQL community for validation.
The case study and experiments performed on a data set is being generated from 04 out of 400 automated machines that generates approximately 0.1 Million units produced per day. Each unit can further be categorized on an average of 100 sub units of the main unit. This sums up on each day almost 0.1 X 100 records are being generated from one production environment. Such a huge data must be scaled by utilizing the Big Data power of parallelism and scalability. There is a major breakthrough in generating decision based alerts on it.
There are two primary objectives of standardized schema transformation:
• Aim to minimize the work of database administrators, and
• Set transformation standards in order to achieve seamless transition from SQL-Server tables to Hbase tables.
SQL Server Table Schema and Metadata
We start by standardizing the SQL Server table metadata in order to achieve the transformation process. The table metadata consists of several fields namely (column name, data type, default value, computed value, description of column)
• Our model is based on the metadata information stored in the field ‘Description’.
• In any SQL Server table, we set the ‘Description’ column to hold the name which needs to be mapped to the Hbase column.
• Our schema mapping program looks for the fields with metadata column information and pick only those fields for the mapping.
• Figure 2 shows how the transformation takes place by setting the metadata.
From a logical perspective, HBase model is comprises of four major components [25].
• A Table is a collection of column families
• Colum Family is a logical and physical grouping of columns and refers to as basic storage unit.
• Columns as compared to traditional RDBMS model and are very different in HBase. Columns exist only when data is inserted and can have multiple versions based on the timestamp auto generated at the time of data insertion. Therefore, for a single row key there can be multiple versions based on the timestamp and only last available version is queried until or unless another version is queried explicitly. Every row in a table can have different set of columns identified by its row-key.
• Row-Key is an implicit primary key to uniquely identify the record. It is an ordered key and therefore provide efficient query processing (Figure 3).
For the conversion of one-to-one relationship, there are two possible conceivable approaches to translate it into HBase table.
• Single HTable
The first would be to combine the two tables together into one HTable. The subsequent HTable could have two originations.1) Single Column Family.2) Multiple or Two Columns families.
• Two or Multiple HTables
The second conceivable configuration is to make two HTables where each HTable contains one column family and every column family contains all SQL columns. At last, embed the row key of each HTable into both HTables.
In the transformation of a one-to-many relationship, there are two conceivable approaches to transform it into HBase:
• Single HTable
The first would be to combine the two tables together into one HTable. The subsequent HTable could have two originations.1) Single Column Family.2) Multiple or Two Columns families.
• Two or Multiple HTables
Each SQL table is added to two separate HTables with a single column family. The primary HTable contains the actual relationship and a second column family is included that holds the referenced column Keys from the second HTable.
An experiment has been carried out on a dataset listed in Table 2 and the process is as follows:
• Step I: SQL Server table columns were marked by adding metadata information ‘Description’. In a single table only those columns which need to be imported and mapped to HBase table are defined in ‘Description’ metadata.
• Step II: Our script checks for the metadata information based on a SQL query run on the information metadata of a SQL Server database. That script picks only those columns and generates a list.
• Step III: Explained in Table 4, based on the metadata information, we generate the CREATE TABLE HBASE scripts for those SQL server table. This helps developers and administrators to have pre-generated scripts available for them to run on the Hadoop HBase environment.
Table 3 shows the workloads on which we proposed to perform the performance testing and as a future study we can compare it with RDBMS query time.
Figure 4 depicts a virtual machine setup for the Hadoop environment to perform tests.
Provided, the RDBMS sample table of [pp.production] from Table 2, the Table 3 shows the data generated on sample basis from four machines. This RDBMS table model, when transforming to HBase model and has been done by using our proposed approach.
We take the Row#1 and Row#5 for the same machine that generates different outputs during the production scheduling based on the shifts.
Table 4 is a mapping of table IV structure from SQL Server to HBase tables. We explain here how the single and multiple column families can be used and how the timestamp column works (Figure 5).
In this research we studied the problem of transforming SQL to NoSQL providing a general purpose solution that can be further utilized in a practical scenario. Our initial results show that this approach has a great potential in order to transform large volume of data from SQL Server to Big Data based on a schema transformation which is easier while using our approach.
We are planning to extend this study by analyzing the performance improvement which we gain by transforming normal data into Big Data structure and produce highly scalable analytical solution, particularly targeting a process manufacturing sector production data that is of very high volume.
As a future work, this study can be extended to produce a commercial solution that can be attached with the relational database engine to transform the data into Big Data in a very efficient manner. Industries in all over the world that manufactures products on automated machines can take benefit of this solution.
Subscribe to our articles alerts and stay tuned.
PTZ: We're glad you're here. Please click "create a new query" if you are a new visitor to our website and need further information from us.
If you are already a member of our network and need to keep track of any developments regarding a question you have already submitted, click "take me to my Query."