Changes

Jump to navigation Jump to search
no edit summary
Line 137: Line 137:  
   <h4>Challenges</h4>
 
   <h4>Challenges</h4>
 
   <p>Although Data Lake technology has many benefits for organizations dealing with big data it has its own challenges. For example:</p>
 
   <p>Although Data Lake technology has many benefits for organizations dealing with big data it has its own challenges. For example:</p>
   <p><b>Data Governance and Semantic Issues</b></p>
+
   <p><b><u>Data Governance and Semantic Issues</u></b></p>
 
   <p class="expand inline mw-collapsible-content">The biggest challenge for Data Lakes is to resolve assorted data governance requirements in a single centralized data platform. Data Lakes fail mostly when they lack governance, self-disciplined users, and a rational data flow.</p><p class="inline">Often, Data Lake implementations are focused on storing data instead of managing the data. Data Lakes are not optimized for semantic enforcement or consistency. They are made for semantic flexibility, to allow anyone to provide context to data if they have the skills to do so. </p>
 
   <p class="expand inline mw-collapsible-content">The biggest challenge for Data Lakes is to resolve assorted data governance requirements in a single centralized data platform. Data Lakes fail mostly when they lack governance, self-disciplined users, and a rational data flow.</p><p class="inline">Often, Data Lake implementations are focused on storing data instead of managing the data. Data Lakes are not optimized for semantic enforcement or consistency. They are made for semantic flexibility, to allow anyone to provide context to data if they have the skills to do so. </p>
 
   <p>Putting data in the same place does not remove it’s ambiguity or meaning. Data Lakes provide unconstrained, “no compromises” storage model environment without the data governance assurances common to data warehouses or data marts. Proper meta data is essential for a Data Lake, without appropriate meta data the Data Lake will not work as intended. It is beneficial to think of meta data as the fish finder in the Data Lake.</p>
 
   <p>Putting data in the same place does not remove it’s ambiguity or meaning. Data Lakes provide unconstrained, “no compromises” storage model environment without the data governance assurances common to data warehouses or data marts. Proper meta data is essential for a Data Lake, without appropriate meta data the Data Lake will not work as intended. It is beneficial to think of meta data as the fish finder in the Data Lake.</p>
   <p><b class="expand mw-collapsible-content">Lack of Quality and Trust in Data</b></p>
+
   <p><b class="expand mw-collapsible-content"><u>Lack of Quality and Trust in Data</u></b></p>
 
   <p class="expand mw-collapsible-content">Data quality and trust in the data is a perennial issue for many organizations. Although data discovery tools can apply Machine Learning across related datasets from multiple data sources to identify anomalies (incorrect values, missing values, duplicates and outdated data), quality and trustworthiness of data continue to be an issue for Data Lakes who can easily become data dumping grounds. Some data is more accurate than others. This can present a real problem for anyone using multiple data sets and making decisions based upon analysis conducted with data of varying degrees of quality.</p>
 
   <p class="expand mw-collapsible-content">Data quality and trust in the data is a perennial issue for many organizations. Although data discovery tools can apply Machine Learning across related datasets from multiple data sources to identify anomalies (incorrect values, missing values, duplicates and outdated data), quality and trustworthiness of data continue to be an issue for Data Lakes who can easily become data dumping grounds. Some data is more accurate than others. This can present a real problem for anyone using multiple data sets and making decisions based upon analysis conducted with data of varying degrees of quality.</p>
   <p><b>Data Swamps, Performance, and Flexibility Challenges</b></p>
+
   <p><b><u>Data Swamps, Performance, and Flexibility Challenges</u></b></p>
 
   <p class="expand inline mw-collapsible-content">Data stored in Data Lakes can sometimes become muddy when good data is mixed with bad data. Data Lake infrastructure is meant to store and process large amounts of data, usually in massive data files. </p><p class="inline">A Data Lake is not optimized for a high number of users or diverse and simultaneous workloads due to intensive query tasks. This can result in performance degradation and failures are common when running extractions, transformations, and loading tasks all at the same time. On-premises Data Lakes face other performance challenges in that they have a static configuration. </p>
 
   <p class="expand inline mw-collapsible-content">Data stored in Data Lakes can sometimes become muddy when good data is mixed with bad data. Data Lake infrastructure is meant to store and process large amounts of data, usually in massive data files. </p><p class="inline">A Data Lake is not optimized for a high number of users or diverse and simultaneous workloads due to intensive query tasks. This can result in performance degradation and failures are common when running extractions, transformations, and loading tasks all at the same time. On-premises Data Lakes face other performance challenges in that they have a static configuration. </p>
   <p><b class="expand mw-collapsible-content">Data Hoarding and Storage Capacity</b></p>
+
   <p><b class="expand mw-collapsible-content"><u>Data Hoarding and Storage Capacity</u></b></p>
 
   <p class="expand mw-collapsible-content">Data stored in Data Lakes may actually never be used in production and stay unused indefinitely in the Data Lake. By storing massive amounts of historical data, the infinite Data Lake may skew analysis with data that is no longer relevant to the priorities of the business. In keeping the historical data the metadata describing it must be understood as well. This decreases the performance of the Data Lake by increasing the overall workload of employees to clean the datasets no longer in use for analysis.</p>
 
   <p class="expand mw-collapsible-content">Data stored in Data Lakes may actually never be used in production and stay unused indefinitely in the Data Lake. By storing massive amounts of historical data, the infinite Data Lake may skew analysis with data that is no longer relevant to the priorities of the business. In keeping the historical data the metadata describing it must be understood as well. This decreases the performance of the Data Lake by increasing the overall workload of employees to clean the datasets no longer in use for analysis.</p>
 
   <p class="expand mw-collapsible-content">Storing increasingly massive amounts of data for an unlimited time will also lead to scalability and cost challenges. Scalability challenges are less of a risk in public cloud environments, but cost remains a factor. On-premises Data Lakes are more susceptible to cost challenges. This is because their cluster nodes require all three dimensions of computing (storage, memory and processing). Organizations of all kinds generate massive amounts of data (including meta data) and it is increasing exponentially.</p>
 
   <p class="expand mw-collapsible-content">Storing increasingly massive amounts of data for an unlimited time will also lead to scalability and cost challenges. Scalability challenges are less of a risk in public cloud environments, but cost remains a factor. On-premises Data Lakes are more susceptible to cost challenges. This is because their cluster nodes require all three dimensions of computing (storage, memory and processing). Organizations of all kinds generate massive amounts of data (including meta data) and it is increasing exponentially.</p>
 
   <p>The storage capacity of all this data (and future data) will be an ongoing challenge and one that will require constant management. While Data Lakes can and will be stored on the cloud, SSC as cloud broker for the GC will need to provide the appropriate infrastructure and scalability to clients.</p>
 
   <p>The storage capacity of all this data (and future data) will be an ongoing challenge and one that will require constant management. While Data Lakes can and will be stored on the cloud, SSC as cloud broker for the GC will need to provide the appropriate infrastructure and scalability to clients.</p>
   <p><b class="expand mw-collapsible-content">Advanced Users Required</b></p>
+
   <p><b class="expand mw-collapsible-content"><u>Advanced Users Required</u></b></p>
 
   <p class="expand mw-collapsible-content">Data Lakes are not a platform to be explored by everyone. Data Lakes present an unrefined view of data that usually only the most highly skilled analysts are able to explore and engage in data refinement independent of any other formal system-of-record such as a data warehouse. </p>
 
   <p class="expand mw-collapsible-content">Data Lakes are not a platform to be explored by everyone. Data Lakes present an unrefined view of data that usually only the most highly skilled analysts are able to explore and engage in data refinement independent of any other formal system-of-record such as a data warehouse. </p>
 
   <p class="expand mw-collapsible-content">Not just anyone in an organization is data-literate enough to derive value from large amounts of raw or uncurated data. The reality is only a handful of staff are skilled enough to navigate a Data Lake. Since Data Lakes store raw data their business value is entirely determined by the skills of Data Lake users. These skills are often lacking in an organization.</p>
 
   <p class="expand mw-collapsible-content">Not just anyone in an organization is data-literate enough to derive value from large amounts of raw or uncurated data. The reality is only a handful of staff are skilled enough to navigate a Data Lake. Since Data Lakes store raw data their business value is entirely determined by the skills of Data Lake users. These skills are often lacking in an organization.</p>
   <p><b>Data Security</b></p>
+
   <p><b><u>Data Security</u></b></p>
 
   <p class="inline"></p>Data in a Data Lake lacks standard security protection with a relational database management system or an enterprise database. In practice, this means that the data is unencrypted and lacks access control.<p class="expand inline mw-collapsible-content">. Security is not just a binary solution. We have varying degrees of security (unclassified, secret, top secret, etc.) and all of which require different approaches. This will inevitably present challenges with the successful use of data from Data Lakes.To combat this, organizations will have to embrace a new security framework to be compatable with Data Lakes and Data Scientists.</p>
 
   <p class="inline"></p>Data in a Data Lake lacks standard security protection with a relational database management system or an enterprise database. In practice, this means that the data is unencrypted and lacks access control.<p class="expand inline mw-collapsible-content">. Security is not just a binary solution. We have varying degrees of security (unclassified, secret, top secret, etc.) and all of which require different approaches. This will inevitably present challenges with the successful use of data from Data Lakes.To combat this, organizations will have to embrace a new security framework to be compatable with Data Lakes and Data Scientists.</p>
 
   <h4>Considerations</h4>
 
   <h4>Considerations</h4>

Navigation menu

GCwiki