Data Lakes Architecture


when we talk of data, it comes in various formats and structure and an administrator or so called the DBAs are there to manage. AS business rules change, needs change and new technologies emerge With time. New tech-terms keep emerging. Much talked about now is the term DATA LAKES, courtesy exponential growth of bits and bytes leading to terabytes and petabytes. So now, we have a new term DATA LAKES. Understanding the term Data Lake, as the names goes we can conclude, its something related to data and the analogy is lake. Data lakes, stores everything all forms of data arising in various format. it can be structured, semi structured or unstructured and the formats can be character or binary. Relating all this to Data Lakes, its just like a repository of data, or replicating data or aggregation of data stored. Data lakes are considered to be most cost effective way to store data of an organization, as raw data.Data lakes generally follow a flat structure unlike warehouses. The data elements in a data lake are generally tagged with set of meta data with a unique identifier.

Data Lakes can be used for various purpose, most common, for analysis and analytics and by researchers too. These are used for processing as and when the need arises for research and analysis. Data lakes are mostly deployed on cloud, generally distributed to make it globally accessible.

Architecture of Data lakes

Data lakes pivots around the architecture. having a versatile architecture, organizations can have seamless, high-performance analytics and governance, even if the data arrives from multiple locations. Architecture of Data Lakes can also vary, depending on-site or cloud deployed. Though here we just discuss a conventional architect of data lakes. So we can rely on the fact that architect of any data lakes. The architect of data lakes revolves around following three components.

A. SOURCES :Source is what floods in the data lake, mainly the data. It is the business data flowing into the data lake. Varied Data sources are present for every data lakes.  The data migrators use ETL and ELT methodologies for various data processing from various source. These sources are further categories as homogeneous and heterogeneous sources, for ETL processing.

  • Homogeneous Source: More straight forwards, less complexed and structured data source, or a homogeneous data types. Data can be related with joins. These are the sources originating from databases like ORACLE Database, MS SQL, MYSQL, etc.
  • Heterogeneous Source : More complex, can be structures, semi structured or even unstructured data. More like hybrid or various data structures and formats. These are the data that can be a flat file, Data from EDI, HL7 health care data. ETL and ELT professional working on Heterogeneous is quite brain munching.

Some of the Data lake source explained in brief:

a) Business Applications : Business application generate phenomenal data. These data generated from the business transitions occurring in the business. various application that generates such data are application packages like ERP, CRM, SCM or the custom legacy system which are used to capture business transactions. These are mainly databases or file-based data.

b) EDW: An enterprise data warehouse or EDW can be a source for data lakes, to create consolidate data reference using other sources of data. These can be Standard RDBMS based EDW or cloud-based Data warehouse. The connection with the various database is established using SSIS that comprises of connectors like ODBC, JDBC, etc, by ETL tools or by ETL professionals.

c) Multiple Documents: Various business organization store business data in structured flat files( Excel, Txt, CSV. etc). These files stores various information's about the business transactions. Data lakes are poured in with various such flat files or documents that have business transaction information. Semi structures files like XML, JSON and AVRO are also used in building sources of data lakes.

d) SaaS Applications : various SaaS based application are now deployed in various business, these are now being deployed on cloud and managed by provider. Some of the SaaS based application are Salesforce CRM, Microsoft Dynamics CRM, SAP Business By Design, SAP Cloud for Customers. Oracle CRM On Demand. Data generated by these application are also sources for various data lakes deployed on cloud.

e) Log Files : mainly device log file that generates log automated reports, these information stored in data lakes are useful for various performance related analysis. Eg a server log reports can be useful for load balancing analysis.

f) IoT Transducers: IoT market is sky rocketing, and so is the data generated by these IoT transducer(devices). These are generally a real time data procession with the data lake setup. 

Data Processing Layer : 
This layer of the data lake architecture comprises of various component. The purpose of the data processing layers is to provide high availability of data, this can even be on demand. The various components that comprise for the data processing layer in a reliable data lakes are datastores and metadata store. data replicator are also an important component that provide high data availability. The data processing layer should fully comply with all data governance and data auditing with effective data security and scalability of data. There are various third party tools used as data processing layer in data lakes. some of the tools are from AWS, Quobole, Infor , Azure, etc


Target are the output from the data lakes. these are the data processed and delivered to a target system or any application. Various system applications derive data from data lakes. Data is used from data lakes using various API or data connectors.

Some Key Features of Data Lakes

Interface : Data lakes should come with a flexible and a diverse interface to manage data accessibility. Should be having a heterogeneous environments with various APIs, data uploading facilities, shifting data. These can be the key in any data lakes, as data lakes supports no limit for various possibilities

Access Control: Well defined ACL should be implemented for providing access to data. This can be flat or hierarchical. The well defined ACL can provide al the legitimate data use/access, infact the data security. Owners of data should be able to set permission to data owned by them. Access control, encryption, and network security features are critical for data governance.

Search & Catalogue :Data log should not be considered as a dumping yard for data. Data stored in the data lakes for reuse, if not 100%, but higher percentage makes data lakes more successful. Efficient search or data cataloging should be implemented in data lakes for retrieving or mining of data from data lakes. The search features in data lakes should include include optimized key-value storage, metadata, tagging, or tools for collecting and classifying subsets of all objects, making data retrieval from data lakes more efficient.

The data lakes used by

Support for the construction of or connection to processing and analytics layers. Analysts, data scientists, machine learning engineers, and decision-makers all derive the greatest benefit from centralized, fully available data, so the lake must support their various processing, transformation, aggregation, and analytical needs.