Changes

no edit summary
Line 23: Line 23:  
         </th>
 
         </th>
 
       </tr>
 
       </tr>
       <tr><td colspan="2" class="logo">[[KubernetesIMG.png|200px]]</td></tr>
+
       <tr><td colspan="2" class="logo">[[File:Technology_Trends_-_Datalakes_logo.png|200px]]</td></tr>
 
       <tr>
 
       <tr>
 
         <th>Status</th>
 
         <th>Status</th>
Line 30: Line 30:  
       <tr>
 
       <tr>
 
         <th>Initial release</th>
 
         <th>Initial release</th>
         <td>May 5, 2019</td>
+
         <td>July 12, 2019</td>
 
       </tr>
 
       </tr>
 
       <tr>
 
       <tr>
 
         <th>Latest version</th>
 
         <th>Latest version</th>
         <td>July 18, 2019</td>
+
         <td>February 17, 2020</td>
 
       </tr>
 
       </tr>
 
       <tr>
 
       <tr>
 
         <th>Official publication</th>
 
         <th>Official publication</th>
         <td>[[Media:EN_-_Kubernetes_v0.1_EN.pdf|Kubernetes.pdf]]</td>
+
         <td>[[Media:EN_-_Technology_Trends_-_Datalakes.pdf|Datalakes.pdf]]</td>
 
       </tr>
 
       </tr>
 
       <tr><td colspan="2" class="disclaimer"><table><tr>
 
       <tr><td colspan="2" class="disclaimer"><table><tr>
Line 55: Line 55:  
   <h2>Business Brief</h2>
 
   <h2>Business Brief</h2>
 
   <p>In an ever-increasing hyperconnected world, corporations and businesses are struggling to deal with the responsibilities of storage, management and quick availability of raw data. To break these data challenges down further:</p>
 
   <p>In an ever-increasing hyperconnected world, corporations and businesses are struggling to deal with the responsibilities of storage, management and quick availability of raw data. To break these data challenges down further:</p>
   <ul class="expand mw-collapsible-content">
+
   <ul>
 
     <li>Data comes in many different structures.
 
     <li>Data comes in many different structures.
 
       <ul>
 
       <ul>
Line 98: Line 98:  
<p>To summarize, usecases for Data Lakes are still being discovered. Cloud providers are making it easier to procure Data Lakes and today Data Lakes are primarily used by Research Institutions, Financial Services, Telecom, Media, Retail, Manufacturing, Healthcare, Pharma, Oi l& Gas and Governments.
 
<p>To summarize, usecases for Data Lakes are still being discovered. Cloud providers are making it easier to procure Data Lakes and today Data Lakes are primarily used by Research Institutions, Financial Services, Telecom, Media, Retail, Manufacturing, Healthcare, Pharma, Oi l& Gas and Governments.
 
</p>
 
</p>
  −
      
   <h2>Technology Brief</h2>
 
   <h2>Technology Brief</h2>
Line 116: Line 114:  
     <li><p><b>Distributed processing </b>capabilities associated with a logical data warehouse.</p></li>
 
     <li><p><b>Distributed processing </b>capabilities associated with a logical data warehouse.</p></li>
 
   </ul>
 
   </ul>
   <b class="expand mw-collapsible-content">How TD Bank Made Its Data Lake More Usa</b>
+
   <p class="expand mw-collapsible-content"><b>[https://www.datanami.com/2017/10/03/td-bank-made-data-lake-usable How TD Bank Made Its Data Lake More Usa]</b></p>
  <p class="expand mw-collapsible-content">[[https://www.datanami.com/2017/10/03/td-bank-made-data-lake-usable/]]<br>Toronto-Dominion Bank (TD Bank) is one of the largest banks in North America, with 85,000 employees, more than 2,400 locations between Canada and the United States, and assets nearing $1 trillion. In 2014, the company decided to standardize how it warehouses data for various business intelligence and regulatory reporting functions. The company purchased a Hadoop distribution and set off to build a large cluster that could function as a centralized lake to store data originating from a variety of departments.</p>
+
  <p class="expand mw-collapsible-content">Toronto-Dominion Bank (TD Bank) is one of the largest banks in North America, with 85,000 employees, more than 2,400 locations between Canada and the United States, and assets nearing $1 trillion. In 2014, the company decided to standardize how it warehouses data for various business intelligence and regulatory reporting functions. The company purchased a Hadoop distribution and set off to build a large cluster that could function as a centralized lake to store data originating from a variety of departments.</p>
    
   <h2>Canadian Government Use</h2>
 
   <h2>Canadian Government Use</h2>
<p>In 2019, the Treasury Board of Canada Secretariat (TBS), partnered with Shared Services Canada and other departments, to identify a business lead to develop a Data Lake (a repository of raw data) service strategy so that the GC can take advantage of big data and market innovation to foster better analytics and promote horizontal data-sharing. </p>
+
<p>In 2019, the Treasury Board of Canada Secretariat (TBS), partnered with Shared Services Canada and other departments, to identify a business lead to develop a Data Lake (a repository of raw data) service strategy so that the GC can take advantage of big data and market innovation to foster better analytics and promote horizontal data-sharing.<ref>Treasury Board of Canada Secretariat. (March 29th, 2019). Digital Operations Strategic Plan: 2018-2022. Government of Canada. Treasury Board of Canada Secretariat. Retrieved 26-May-2019 from: <i>[https://www.canada.ca/en/government/system/digital-government/digital-operations-strategic-plan-2018-2022.html] </i></ref> </p>
<p class="expand mw-collapsible-content">Big data is the technology that stores and processes data and information in datasets that are so large or complex that traditional data processing applications can’t analyze them. Big data can make available almost limitless amounts of information, improving data-driven decision-making and expanding open data initiatives. Business intelligence involves creating, aggregating, analyzing and visualizing data to inform and facilitate business management and strategy. TBS, working with departments, will lead the development of requirements for an enterprise analytics platform.</p>
+
<p class="expand mw-collapsible-content">Big data is the technology that stores and processes data and information in datasets that are so large or complex that traditional data processing applications can’t analyze them. Big data can make available almost limitless amounts of information, improving data-driven decision-making and expanding open data initiatives. Business intelligence involves creating, aggregating, analyzing and visualizing data to inform and facilitate business management and strategy. TBS, working with departments, will lead the development of requirements for an enterprise analytics platform.<ref>Ibid.<i></i></ref></p>
 
<p>Data Lake development in the GC is a more recent initiative. This is mainly due to the GC focussing resources on the implementation of cloud initiatives. However, there are some GC departments engaged in developing Data Lake environments in tandem to cloud initiatives.</p>
 
<p>Data Lake development in the GC is a more recent initiative. This is mainly due to the GC focussing resources on the implementation of cloud initiatives. However, there are some GC departments engaged in developing Data Lake environments in tandem to cloud initiatives.</p>
<p class="expand mw-collapsible-content">Notably, the Employment and Social Development Canada (ESDC) is preparing the installment of multiple Data Lakes in order to enable a Data Lake Ecosystem and Data Analytics and Machine Learning toolset. This will enable ESDC to share information horizontally both effectively and safely, while enabling a wide variety of data analytics capabilities. ESDC aims to maintain current data and analytics capabilities up-to-date while exploring new ones to mitigate gaps and continuously evolve our services to meet client’s needs. </p>
+
<p class="expand mw-collapsible-content">Notably, the Employment and Social Development Canada (ESDC) is preparing the installment of multiple Data Lakes in order to enable a Data Lake Ecosystem and Data Analytics and Machine Learning toolset. This will enable ESDC to share information horizontally both effectively and safely, while enabling a wide variety of data analytics capabilities. ESDC aims to maintain current data and analytics capabilities up-to-date while exploring new ones to mitigate gaps and continuously evolve our services to meet client’s needs.<ref>Brisson, Yannick, and Craig, Sheila. (November, 2018). ESDC Data Lake – Implementation Strategy and Roadmap Update. Government of Canada. Employment and Social Development Canada – Data and Analytics Services. Presentation. Last Modified on 2019-04-26 15:45. Retrieved 07-May-2019 from GCDocs<i>[https://gcdocs.gc.ca/ssc-spc/llisapi.dll?func=ll&objaction=overview&objid=36624914 ]</i></ref> </p>
 
   <h2>Implications for Government Agencies</h2>
 
   <h2>Implications for Government Agencies</h2>
 
   <h3>Shared Services Canada (SSC)</h3>
 
   <h3>Shared Services Canada (SSC)</h3>
 
   <h4>Value Proposition</h4>
 
   <h4>Value Proposition</h4>
 
   <p class="expand mw-collapsible-content">There are three common value propositions for pursuing Data Lakes. 1) It can provide an easy and accessible way to obtain data faster; 2) It can create a singular inflow point of data to help connect and merge information silos in an organization; and 3) It can provide an experimental environment for experienced data scientists to enable new analytical insights.</p>
 
   <p class="expand mw-collapsible-content">There are three common value propositions for pursuing Data Lakes. 1) It can provide an easy and accessible way to obtain data faster; 2) It can create a singular inflow point of data to help connect and merge information silos in an organization; and 3) It can provide an experimental environment for experienced data scientists to enable new analytical insights.</p>
 +
  <p class="inline-spacer"> </p>
 
   <p class="inline">Data Lakes can provide data to consumers more quickly by offering data in a more raw and easily accessible form. Data is stored in its native form with little to no processing, it is optimized to store vast amounts of data in their native formats. By allowing the data to remain in its native format, a much timelier stream of data is available for unlimited queries and analysis. A Data Lake can help data consumers bypass strict data retrieval and data structured applications such as a data warehouse and/or data mart. This has the effect of improving a business’ data flexibility.</p><p class="expand inline mw-collapsible-content">Some companies have in fact used Data Lakes to replace existing warehousing environments where implementing a new data warehouse is more cost prohibitive. A Data Lake can contain unrefined data, this is helpful when either a business data structure is unknown, or when a data consumer requires access to the data quickly. </p>
 
   <p class="inline">Data Lakes can provide data to consumers more quickly by offering data in a more raw and easily accessible form. Data is stored in its native form with little to no processing, it is optimized to store vast amounts of data in their native formats. By allowing the data to remain in its native format, a much timelier stream of data is available for unlimited queries and analysis. A Data Lake can help data consumers bypass strict data retrieval and data structured applications such as a data warehouse and/or data mart. This has the effect of improving a business’ data flexibility.</p><p class="expand inline mw-collapsible-content">Some companies have in fact used Data Lakes to replace existing warehousing environments where implementing a new data warehouse is more cost prohibitive. A Data Lake can contain unrefined data, this is helpful when either a business data structure is unknown, or when a data consumer requires access to the data quickly. </p>
  <p></p>
+
<p class="inline-spacer"> </p>
  <p class="inline">A Data Lake is not a single source of truth. A Data Lake is a central location in which data converges from all data sources and is stored, regardless of the data formatting. </p><p class="expand inline mw-collapsible-content">As a singular point for the inflow of data, sections of a business can pool their information together in the Data Lake and increase the sharing of information with other parts of the organization. In this way everyone in the organization has access to the data. A Data Lake can increase the horizontal data sharing within an organization by creating this singular data inflow point. Using a variety of storage and processing tools analysts can extract data value quickly in order to inform key business decisions.</p>
+
<p class="inline">A Data Lake is not a single source of truth. A Data Lake is a central location in which data converges from all data sources and is stored, regardless of the data formatting. </p><p class="expand inline mw-collapsible-content">As a singular point for the inflow of data, sections of a business can pool their information together in the Data Lake and increase the sharing of information with other parts of the organization. In this way everyone in the organization has access to the data. A Data Lake can increase the horizontal data sharing within an organization by creating this singular data inflow point. Using a variety of storage and processing tools analysts can extract data value quickly in order to inform key business decisions.</p>
   <p></p>
+
   <p class="inline-spacer">   </p>
 
   <p class="expand inline mw-collapsible-content">A Data Lake is optimized for exploration and provides an experimental environment for experienced data scientists to uncover new insights from data. Analysts can overlay context on the data to extract value. All organizations want to increase analytics and operational agility.</p><p class="inline">The Data Lake architectural approach can store large volumes of data, this can be a way in which cross-cutting teams can pool their data in a central location and by complementing their systems of record with systems of insight. </p><p class="expand inline mw-collapsible-content">Data Lakes present the most potential benefits for experienced and competant data scientists. </p><p class="inline">Having structured, unstructured and semistructured data, usually in the same data set, can contain business, predictive, and prescriptive insights previously not possible from a structured platform as observed in data warehouses and data marts.</p>
 
   <p class="expand inline mw-collapsible-content">A Data Lake is optimized for exploration and provides an experimental environment for experienced data scientists to uncover new insights from data. Analysts can overlay context on the data to extract value. All organizations want to increase analytics and operational agility.</p><p class="inline">The Data Lake architectural approach can store large volumes of data, this can be a way in which cross-cutting teams can pool their data in a central location and by complementing their systems of record with systems of insight. </p><p class="expand inline mw-collapsible-content">Data Lakes present the most potential benefits for experienced and competant data scientists. </p><p class="inline">Having structured, unstructured and semistructured data, usually in the same data set, can contain business, predictive, and prescriptive insights previously not possible from a structured platform as observed in data warehouses and data marts.</p>
    
   <h4>Challenges</h4>
 
   <h4>Challenges</h4>
   <p>The greatest challenge in regards to Kubernetes is its complexity. However, security, storage and networking, maturity, and competing enterprise transformation priorities are also challenges facing the Kubernetes technology.</p><br><b>Kubernetes Complexity and Analyst Experience</b>
+
   <p>Although Data Lake technology has many benefits for organizations dealing with big data it has its own challenges. For example:</p>
   <p>There is the challenge of a lack of organizational and analyst experience with container management and in using Kubernetes. Managing, updating, and changing a Kubernetes cluster can be operationally complex, more so if the analysts have a lack of experience. The system itself does provide a solid base of infrastructure for a Platform as a Service (PaaS) framework, which can reduce the complexity for developers. However, testing within a Kubernetes environment is still a complex task. Although its use cases in testing are well noted, testing several moving parts of an infrastructure to determine proper application functionality is still a more difficult endeavour <ref>Clayton, T. and Watson, R. (2018). Using Kubernetes to Orchestrate Container-Based Cloud and Microservices Applications. [online] Gartner.com. Available at: <i>[https://www.gartner.com/doc/3873073/using-kubernetes-orchestrate-containerbased-cloud]</i></ref>. This means a lot of new learning will be needed for operations teams developing and managing Kubernetes infrastructure. The larger the company, the more likely the Kubernetes user is to face container challenges<ref>Williams, Alex, et al. Kubernetes Deployment & Security Patterns. The New Stack. 2019. 20180622. thenewstack.io. Retrieved 15-May-2019 from: <i>[https://thenewstack.io/ebooks/kubernetes/kubernetes-deployment-and-security-patterns/]</i></ref>. </p><br><b>Security</b>
+
  <p><b><u>Data Governance and Semantic Issues</u></b></p>
   <p>In a distributed, highly scalable environment, traditional and typical security patterns will not cover all threats. Security will have to be aligned for containers and in the context of Kubernetes. It is critical for operations teams to understand Kubernetes security in terms of containers, deployment, and network security. Security perimeters are porous, containers must be secured at the node level, but also through the image and registry. Security practices in the context of various deployment models will be a persistent challenge<ref>Williams, Alex, et al. Kubernetes Deployment & Security Patterns. The New Stack. 2019. 20180622. thenewstack.io. Retrieved 15-May-2019 from: <i>[https://thenewstack.io/ebooks/kubernetes/kubernetes-deployment-and-security-patterns/]</i></ref>. </p><br><b>Storage & Networking</b>
+
   <p class="expand inline mw-collapsible-content">The biggest challenge for Data Lakes is to resolve assorted data governance requirements in a single centralized data platform. Data Lakes fail mostly when they lack governance, self-disciplined users, and a rational data flow.</p><p class="inline">Often, Data Lake implementations are focused on storing data instead of managing the data. Data Lakes are not optimized for semantic enforcement or consistency. They are made for semantic flexibility, to allow anyone to provide context to data if they have the skills to do so. </p>
   <p>Storage and networking technologies are pillars of data center infrastructure, but were designed originally for client/server and virtualized environments. Container technologies are leading companies to rethink how storage and networking technologies function and operate<ref>Williams, Alex, et al. Kubernetes Deployment & Security Patterns. The New Stack. 2019. 20180622. thenewstack.io. Retrieved 15-May-2019 from: <i>[https://thenewstack.io/ebooks/kubernetes/kubernetes-deployment-and-security-patterns/]</i></ref>. Architectures are becoming more application-oriented and storage does not necessarily live on the same machine as the application or its services. Larger companies tend to run more containers, and to do so in scaled-out production environments requires new approaches to infrastructure<ref>Williams, Alex, et al. Kubernetes Deployment & Security Patterns. The New Stack. 2019. 20180622. thenewstack.io. Retrieved 15-May-2019 from: <i>[https://thenewstack.io/ebooks/kubernetes/kubernetes-deployment-and-security-patterns/]</i></ref>. </p>
+
  <p>Putting data in the same place does not remove it’s ambiguity or meaning. Data Lakes provide unconstrained, “no compromises” storage model environment without the data governance assurances common to data warehouses or data marts. Proper meta data is essential for a Data Lake, without appropriate meta data the Data Lake will not work as intended. It is beneficial to think of meta data as the fish finder in the Data Lake.</p>
   <p>Some legacy systems can run containers and only sometimes can VMs can be replaced by containers. There may be significant engineering consequences to existing legacy systems if containerization and Kubernetes is implemented in a legacy system not designed to handle that change. Some Legacy systems may require refactoring and making it more suitable for containerization. Some pieces of a system may be able to be broken off and containerized. In general, anything facing the internet should be run in containers.</p><br><b>Maturity</b>
+
  <p><b class="expand mw-collapsible-content"><u>Lack of Quality and Trust in Data</u></b></p>
   <p>Kubernetes maturity as a technology is still being tested by organizations. For now, Kubernetes is the market leader and the standardized means of orchestrating containers and deploying distributed applications. Google is the primary commercial organization behind Kubernetes; however they do not support Kubernetes as a software product. It offers a commercial managed Kubernetes service known as GKE but not as a software. This can be viewed as both a strength and a weakness. Without commercialization, the user is granted more flexibility with how Kubernetes can be implemented in their infrastructure; However, without a concrete set of standards of the services that Kubernetes can offer, there is a risk that Google’s continuous  support cannot be guaranteed. Its donation of Kubernetes code and intellectual property to the Cloud Native Computing Foundation does minimize this risk since there is still an organization enforcing the proper standards and verifying  services Kubernetes can offer moving forward <ref>Clayton, T. and Watson, R. (2018). Using Kubernetes to Orchestrate Container-Based Cloud and Microservices Applications. [online] Gartner.com. Available at: <i>[https://www.gartner.com/doc/3873073/using-kubernetes-orchestrate-containerbased-cloud]</i></ref>. It is also important to note that the organizational challenges that Kubernetes users face have been more dependent on the size of the organization using it.</p>
+
   <p class="expand mw-collapsible-content">Data quality and trust in the data is a perennial issue for many organizations. Although data discovery tools can apply Machine Learning across related datasets from multiple data sources to identify anomalies (incorrect values, missing values, duplicates and outdated data), quality and trustworthiness of data continue to be an issue for Data Lakes who can easily become data dumping grounds. Some data is more accurate than others. This can present a real problem for anyone using multiple data sets and making decisions based upon analysis conducted with data of varying degrees of quality.</p>
   <p>Kubernetes faces competition from other scheduler and orchestrator technologies, such as Docker Swarm and Mesosphere DC/OS. While Kubernetes is sometimes used to manage Docker containers, it also competes with the native clustering capabilities of Docker Swarm<ref>Rouse, Margaret, et al. (August 2017). Kubernetes. TechTarget Inc. 2019. Retrieved 16-May-2019 from: <i>[https://searchitoperations.techtarget.com/definition/Google-Kubernetes]</i></ref>.  However, Kubernetes can be run on a public cloud service or on-premises, is highly modular, open source, and has a vibrant community. Companies of all sizes are investing into it, and many cloud providers offer Kubernetes as a service<ref>Tsang, Daisy. (February 12th, 2018). Kubernetes vs. Docker: What Does It Really Mean? Sumo Logic. 2019. Retrieved 16-May-2019 from: <i>[https://www.sumologic.com/blog/kubernetes-vs-docker/ ]</i></ref>. </p><br><b class="expand mw-collapsible-content">Competing Enterprise Transformation Priorities</b>
+
  <p><b><u>Data Swamps, Performance, and Flexibility Challenges</u></b></p>
  <p class="expand mw-collapsible-content">The last challenge facing Kubernetes initiative development and implementation is its place in an organization’s IT transformation priority list. Often there are many higher priority initiatives that can take president over Kubernetes projects.</p>
+
   <p class="expand inline mw-collapsible-content">Data stored in Data Lakes can sometimes become muddy when good data is mixed with bad data. Data Lake infrastructure is meant to store and process large amounts of data, usually in massive data files. </p><p class="inline">A Data Lake is not optimized for a high number of users or diverse and simultaneous workloads due to intensive query tasks. This can result in performance degradation and failures are common when running extractions, transformations, and loading tasks all at the same time. On-premises Data Lakes face other performance challenges in that they have a static configuration. </p>
 
+
  <p><b class="expand mw-collapsible-content"><u>Data Hoarding and Storage Capacity</u></b></p>
 
+
   <p class="expand mw-collapsible-content">Data stored in Data Lakes may actually never be used in production and stay unused indefinitely in the Data Lake. By storing massive amounts of historical data, the infinite Data Lake may skew analysis with data that is no longer relevant to the priorities of the business. In keeping the historical data the metadata describing it must be understood as well. This decreases the performance of the Data Lake by increasing the overall workload of employees to clean the datasets no longer in use for analysis.</p>
 +
   <p class="expand mw-collapsible-content">Storing increasingly massive amounts of data for an unlimited time will also lead to scalability and cost challenges. Scalability challenges are less of a risk in public cloud environments, but cost remains a factor. On-premises Data Lakes are more susceptible to cost challenges. This is because their cluster nodes require all three dimensions of computing (storage, memory and processing). Organizations of all kinds generate massive amounts of data (including meta data) and it is increasing exponentially.</p>
 +
  <p>The storage capacity of all this data (and future data) will be an ongoing challenge and one that will require constant management. While Data Lakes can and will be stored on the cloud, SSC as cloud broker for the GC will need to provide the appropriate infrastructure and scalability to clients.</p>
 +
  <p><b class="expand mw-collapsible-content"><u>Advanced Users Required</u></b></p>
 +
  <p class="expand mw-collapsible-content">Data Lakes are not a platform to be explored by everyone. Data Lakes present an unrefined view of data that usually only the most highly skilled analysts are able to explore and engage in data refinement independent of any other formal system-of-record such as a data warehouse. </p>
 +
   <p class="expand mw-collapsible-content">Not just anyone in an organization is data-literate enough to derive value from large amounts of raw or uncurated data. The reality is only a handful of staff are skilled enough to navigate a Data Lake. Since Data Lakes store raw data their business value is entirely determined by the skills of Data Lake users. These skills are often lacking in an organization.</p>
 +
  <p><b><u>Data Security</u></b></p>
 +
  <p class="inline"></p>Data in a Data Lake lacks standard security protection with a relational database management system or an enterprise database. In practice, this means that the data is unencrypted and lacks access control.<p class="expand inline mw-collapsible-content">. Security is not just a binary solution. We have varying degrees of security (unclassified, secret, top secret, etc.) and all of which require different approaches. This will inevitably present challenges with the successful use of data from Data Lakes.To combat this, organizations will have to embrace a new security framework to be compatable with Data Lakes and Data Scientists.</p>
 
   <h4>Considerations</h4>
 
   <h4>Considerations</h4>
<b>Strategic Resourcing and Network Planning</b>
+
  <p class="expand mw-collapsible-content">Shared Services Canada (SSC) has an excellent opportunity to capitalize on its mandate of providing data storage service to GC’s other departments. SSC, as the GC’s Service Provider, could potentially a centralized GC Data Lake and allow GC Data Scientists access to this central data using a single unified Data Lake interface. However, this is a project which should be implemented after cloud has been adopted and enterprise data centers have been migrated to in order to provide adequate infrastructure and scaling.</p>
<p>A strategic approach to Kubernetes investments will need to be developed to ensure opportunities are properly leveraged. The GC invests a significant portion of its annual budget on IT and supporting infrastructure. Without strategic Kubernetes direction the fragmented approaches to IT investments, coupled with rapid developing technology and disjointed business practices, can undermine effective and efficient delivery of GC programs and services<ref>Treasury Board of Canada Secretariat. December 3, 2018. Directive on Management of Information Technology. Treasury Board of Canada Secretariat. Government of Canada. Retrieved 27-Dec-2018 from: <i>[https://www.tbs-sct.gc.ca/pol/doc-eng.aspx?id=15249 ]</i></ref>. A clear vision and mandate for how Kubernetes will transform services, and what the end-state Kubernetes initiative is supposed to look like, is a prominent consideration. </p>
+
  <p class="inline">Data Lakes should not be confused for conventional databases although they both store information. A Data Lake will always underperform when tasked with the jobs of a conventional database. </p><p class="expand inline mw-collapsible-content">To combat this, SSC must create data architectures that define the proper application of Data Lakes. Too often, Data Lakes suffer from lack of foresight on what they're supposed to achieve. </p><p class="inline">Creating a Data Lake becomes the goal rather than achieving a strategic objective. </p><p class="expand inline mw-collapsible-content"></p>
<p>SSC should consider defining a network strategy for Kubernetes adoption. Multiple factors should be taken into account, including the amount of resources, funding, and expertise that will be required for the development and experimentation with Kubernetes technologies. Calculation of resource requirements including CPU, memory, storage, etc. at the start of Kubernetes projects is imperative. Considerations include whether or not an in-house Kubernetes solution is required or if a solution can be procured. Other strategy considerations include analyzing different orchestration approaches for different application use cases.</p>
+
  <p class="expand mw-collapsible-content">Shared Services Canada (SSC) should consider designing Data Lake infrastructure around Service-Level Agreements (SLA) to keep Data Lake efforts on track. This includes ensuring that SSC has established clear goals for Data Lakes prior to deployment. </p>
<b>Complexity and Skills Gap</b>
+
  <p class="expand mw-collapsible-content">SSC should also consider building an expert special group focussed on advanced analytics and experimental data trend discovery in Data Lakes. While the fundamental assumption behind the Data Lake concept is that everyone accessing a Data Lake is moderately to highly skilled at data manipulation and analysis, the reality is most are not. SSC should consider significant investment in training employees necessary skills, such as Data Science, Artificial Intelligence, Machine Learning, or Data Engineering.</p>
<p>Kubernetes is a good technology and the de facto standard for orchestrating containers, and containers are the future of modern software delivery. But it is notoriously complex to manage for enterprise workloads, where Service Level Agreements (SLAs) are critical. The operational pain of managing production-grade Kubernetes is further complicated by the industry-wide talent scarcity and skills gap. Most organizations today struggle to hire Kubernetes experts, and even these “experts” lack advanced Kubernetes experience to ensure smooth operations at scale. SSC will need to be cautious in implementing Kubernetes and having the right staff experienced and comfortable in its use.</p>
+
  <p>SSC should be cognisant that there are significant overinflated expectations revolving around Data Lakes. Inflated expectations lead to vague and ambiguous use cases and increased chances of catastrophic failures. As a Service Provider, SSC must be strict in establishing clear goals for Data Lake provision efforts before deployment. SSC, should be wary of attempts to replace strategy development with infrastructure. A Data Lake can be a technology component that supports a data and analytics strategy, but it cannot replace that strategy.</p>
<b>Customization and Integration Still Required</b>
+
  <p class="expand mw-collapsible-content">SSC should be concerned with the provision and running of the infrastructure, the departments themselves are responsible for the data they put in the Data Lake. However, as a Service Provider, SSC should monitor the Data Lake with regards to data governance, data lifecycle for data hygiene, and what is happening in the Data Lake overall. Depending on technology, SSC will need to be very clear on how to monitor activities in the Data Lakes it provides to the GC. </p>
<p>Kubernetes technology and ecosystem are evolving rapidly, because of its relatively new state, it is hard to find packaged solutions with complete out-of-the-box support for complex, large-scale enterprise scenarios. As a large and sophisticated enterprise organization, SSC will need to devote significant resources on customization and training. Enterprise Architecture pros will need to focus on the whole architecture of cloud-native applications as well as keep a close watch on technology evolution and industry. </p>
+
  <p class="expand mw-collapsible-content">SSC should consider a Data Lake implementation project as a way to introduce or reinvigorate a data management program by positioning data management capabilities as a prerequisite for a
<p>Implementation usually takes longer than expected, however the consensus in the New Stack’s Kubernetes User Experience Survey is that Kubernetes reduces code deployment times, and increases the frequency of those deployments<ref>Williams, Alex, et al. The State of the Kubernetes Ecosystem. The New Stack. thenewstack.io. Retrieved 15-May-2019 from: <i>[https://thenewstack.io/ebooks/kubernetes/state-of-kubernetes-ecosystem/ ]</i></ref>. However, in the short run, the implementation phase does consume more human resources. Additionally, implementation takes longer than expected. The consensus is that Kubernetes reduces code deployment times, and increases the frequency of those deployments. However, in the short run, the implementation phase does consume more human resources.</p>
+
successful Data Lake. Data will need to be qualified before it hits the data lake, this can and should be done in a system of record first. In this way the data can be organizedto fit into the Data Lake implementation.
<b>Pilot Small and Scale Success</b>
+
</p>
<p>SSC may wish to consider evaluating the current Service Catalogue in order to determine where Kubernetes can be leveraged first to improve efficiencies, reduce costs, and reduce administrative burdens of existing services as well as how a new Kubernetes service could be delivered on a consistent basis. Any new procurements of devices or platforms should have high market value and can be on-boarded easily onto the GC network. SSC should avoid applying in-house Kubernetes for production mission-critical apps. Failure of in-house deployments is high and thus should be avoided. SSC should pilot and establish a Kubernetes test cluster. With all new cloud-based technologies, piloting is preferred. Focus should first be on a narrow set of objectives and a single application scenario to stand up a test cluster.</p>
+
  <p class="expand mw-collapsible-content">SSC should create policies on how data is managed and cleaned in the Data Lake. Automated data governance technologies should be added to support advanced analytics. Standardizing on a specific type of governance tool is an issue which must be resolved. Additionally, planning for effective metadata management, considering metadata discovery, cataloguing and enterprise metadata management applied to Data Lake implementation is vital. Rigorous application of data discipline and data hygiene is needed. To combat this, SSC should use data management tools and create policies on how data is managed and cleaned in the Data Lake. The majority of Data Lake analysts will prefer to work with clean, enriched, and trusted data. However, data quality is relative to the task at hand. Lowquality data may be acceptable for low-impact analysis or distant forecasting, but unacceptable for tactical or high-impact analysis. SSC assessments should take this into account.</p>
<b>Implement Robust Monitoring, Logging, and Audit Practices and Tools</b>
+
  <p>Design Data Lakes with the elements necessary to deliver reliable analytical results to a variety of data consumers. The goal is to increase cross-business usage in order to deliver advanced analytical insights. Build Data Lakes for specific business units or analytics applications, rather than try to implement some vague notion of a single enterprise Data Lake. However, alternative architectures, like data hubs, are often better fits for sharing data within an organization.</p>
<p>Monitoring provides visibility and detailed metrics of Kubernetes infrastructure. This includes granular metrics on usage and performance across all cloud providers or private data centers, regions, servers, networks, storage, and individual VMs or containers. Improving data center efficiency and utilization on both on-premises and public cloud resources is the goal. Additionally, logging is a complementary function and required capability for effective monitoring is also a goal. Logging ensures that logs at every layer of the architecture are all captured for analysis, troubleshooting and diagnosis. Centralized, distributed, log management and visualization is a key capability<ref>Chemitiganti, Vamsi, and Fray, Peter. (February 20th, 2019). 7 Key Considerations for Kubernetes in Production. The  New Stack. 2019. Retrieved 16-May-2019 from: <i>[https://thenewstack.io/7-key-considerations-for-kubernetes-in-production/]</i></ref>. Lastly, routine auditing, no matter the checks and balances put in place, will cover topics that normal monitoring will not cover. Traditionally, auditing is as a manual process, but the automated tooling in the Kubernetes space is quickly improving.</p>
+
<h2>References</h2>
<b>Security</b>
+
<div style="display: none">
<p>Security is a critical part of cloud native applications and Kubernetes is no exception. Security is a constant throughout the container lifecycle and it is required throughout the design, development, DevOps, and infrastructure choices for container-based applications. A range of technology choices are available to cover various areas such as application-level security and the security of the container and infrastructure itself. Different tools that provide certification and security for what goes inside the container itself (such as image registry, image signing, packaging), Common Vulnerability Exposures/Enumeration (CVE) scans, and more<ref>Chemitiganti, Vamsi, and Fray, Peter. (February 20th, 2019). 7 Key Considerations for Kubernetes in Production. The  New Stack. 2019. Retrieved 16-May-2019 from: <i>[https://thenewstack.io/7-key-considerations-for-kubernetes-in-production/]</i></ref>.. SSC will need to ensure appropriate security measures are used with any new Kubernetes initiatives, including the contents of the containers being orchestrated.</p>
+
<ref>Dennis, A. L. (2018, October 15). Data Lakes 101: An Overview. Retrieved from <i>[https://www.dataversity.net/data-lakes-101-overview/#]</i></ref>
 
+
<ref>Marvin, R., Marvin, R., & Marvin, R. (2016, August 22). Data Lakes, Explained. Retrieved from <i>[ https://www.pcmag.com/article/347020/data-lakes-explained]</i></ref>
  <h2>References</h2>
+
<ref>The Data Lake journey. (2014, March 15). Retrieved from <i>[https://hortonworks.com/blog/enterprise-hadoop-journey-data-lake/]</i></ref>
 +
<ref>Google File System. (2019, July 14). Retrieved from <i>[https://en.wikipedia.org/wiki/Google_File_System]</i></ref>
 +
<ref>Coates, M. (2016, October 02). Data Lake Use Cases and Planning Considerations. Retrieved from <i>[https://www.sqlchick.com/entries/2016/7/31/data-lake-use-cases-and-planning]</i></ref>
 +
<ref>Bhalchandra, V. (2018, July 23). Six reasons to think twice about your data lake strategy. Retrieved from <i>[https://dataconomy.com/2018/07/six-reasons-to-think-twice-about-your-data-lake-strategy/]</i></ref>
 +
<ref>Data Lake Expectations: Why Data Lakes Fail. (2018, September 20). Retrieved from <i>[https://www.arcadiadata.com/blog/the-top-six-reasons-data-lakes-have-failed-to-live-up-to-expectations/]</i></ref>
 +
<ref>Data Lake: AWS Solutions. (n.d.). Retrieved from <i>[https://aws.amazon.com/solutions/data-lake-solution/]</i></ref>
 +
</div>
    
</div>
 
</div>
Line 173: Line 185:     
   #firstHeading::after{
 
   #firstHeading::after{
   content:"Kubernetes";
+
   content:"Datalakes";
 
   }
 
   }
  
262

edits