Changes

no edit summary
Line 80: Line 80:  
   </ul>
 
   </ul>
   −
   <p></p>
+
   <p>In an effort to resolve these data challenges, a new way of managing data was created which drove data oriented companies to invent a new data storage mechanism called a Data Lake.</p>
 +
  <p class="expand mw-collapsible-content">Data Lakes are characterized as: </p>
 +
  <ul class="expand mw-collapsible-content">
 +
    <li>Collect Everything<ul>
 +
      <li>A Data Lake contains all data; raw sources over extended periods of time as well as any processed data.</li>
 +
    </ul></li>
 +
    <li>Dive in anywhere<ul>
 +
      <li>A Data Lake enables users across multiple business units to refine, explore and enrich data on their terms.</li>
 +
    </ul></li>
 +
    <li>Flexible Access<ul>
 +
      <li>o A Data Lake enables multiple data access patterns across a shared infrastructure: batch, interactive, online, search, in-memory and other processing engines.</li>
 +
    </ul></li>
 +
  </ul>
 +
 
 +
  <p>Data Lakes are essentially a technology platform for holding data. Their value to the business is only realized when applying data science skills to the lake. </p>
 +
 
 +
<p>To summarize, usecases for Data Lakes are still being discovered. Cloud providers are making it easier to procure Data Lakes and today Data Lakes are primarily used by Research Institutions, Financial Services, Telecom, Media, Retail, Manufacturing, Healthcare, Pharma, Oi l& Gas and Governments.
 +
</p>
          
   <h2>Technology Brief</h2>
 
   <h2>Technology Brief</h2>
<p>The Kubernetes cluster or deployment can be broken down into several components. The Kubernetes “master” is the machine in charge of managing other “node” machines. The “node is the machine in charge of actually running tasks fed to it via the user or the “master”. The master and nodes can be either a physical or virtual machines. In each Kubernetes cluster, there is one master and multiple nodes machines. The main goal of Kubernetes is to achieve “Desired State Management”. The “master” is fed a specific configuration through its RESTful API which it exposes to the user, and the “master” is then responsible for running this configuration across its set of “node”. The nodes can be thought of as  host of containers. They communicate with the “master” through the agent in each node --“Kubelet” process. To establish a specific configuration in Kubernetes, the “master is fed a deployment file with the “.yaml” extension. This file contains a variety of configuration information. Within this information are “Pods” and “replicas”. There is a concept of Pod in Kubernetes and it can be described as a logic collection of containers which are managed as a single application. Resources can be shared within a Pod, these resources include shared storage (Volumes), a unique cluster of IP addresses, and information about how to run each container. A Pod can be thought of as the basic unit of the Kubernetes object model, it represents the deployment of a single instance of an application in Kubernetes <ref>Kubernetes.io. (2018). Kubernetes Basics - Kubernetes. [online] Available at: <i>[https://kubernetes.io/docs/tutorials/kubernetes-basics/] </i></ref>. A Pod can encapsulate one or more application containers. Two models exist for how Pods are deployed within a cluster. The “one-Pod-per-container” means a single pod will be associated with a single container. There can also be multiple containers that run within a single Pod, where these containers may need to communicate with one another as they share resources. In either model, the Pod can be thought of as a wrapper around the application containers. Kubernetes manages the Pod instances rather than managing the containers directly. The Pods are run on the Node machines to perform tasks. Replicas are simply instances of the Pods. Within the .yaml” deployment file, specifications are instructing the “master” machine how many instances/replicas of each Pod to run, which is handled by a replication controller  <ref>Kubernetes.io. (2018). Kubernetes Basics - Kubernetes. [online] Available at: <i>[https://kubernetes.io/docs/tutorials/kubernetes-basics/] </i></ref>. When a node dies or a running Pod  experiences an unexpected termination, the replication controller will take note take care of this by creating  the appropriate number of Pods  <ref>Kubernetes.io. (2018). Kubernetes Basics - Kubernetes. [online] Available at: <i>[https://kubernetes.io/docs/tutorials/kubernetes-basics/] </i></ref>.</p>
+
  <p class="expand mw-collapsible-content">The most popular implementation of a Data Lake is through the open source platform called Apache Hadoop. Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. Hadoop was originally created by researchers at Google as a storage method to handle the indexing of websites on the Internet; At that time it was called the Google File System. </p>
 +
  <p>A Data Lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.”</p>
 +
  <p class="expand mw-collapsible-content">Data can flow into the Data Lake by either batch processing or real-time processing of streaming data. Additionally, data itself is no longer restrained by initial schema decisions and can be exploited more freely by the enterprise. Rising above this repository is a set of capabilities that allow IT to provide Data and Analytics as a Service (DAaaS), in a supply-demand model. IT takes the role of the data provider (supplier), while business users (data scientists, business analysts) are consumers.</p>
 +
  <p>The DAaaS model enables users to self-serve their data and analytic needs. Users browse the lake’s data catalog (a Datapedia) to find and select the available data and fill a metaphorical “shopping cart” (effectively an analytics sandbox) with data to work with. Once access is provisioned, users can use the analytics tools of their choice to develop models and gain insights. Subsequently, users can publish analytical models or push refined or transformed data back into the Data Lake to share with the larger community.</p>
 +
  <p class="expand mw-collapsible-content">Although provisioning an analytic sandbox is a primary use, the Data Lake also has other applications. For example, the Data Lake can also be used to ingest raw data, curate the data, and apply Export-Transform-Load (ETL). This data can then be loaded to an Enterprise Data Warehouse. To take advantage of the flexibility provided by the Data Lake, organizations need to customize and configure the Data Lake to their specific requirements and domains.</p>
 
   <h2>Industry Usage</h2>
 
   <h2>Industry Usage</h2>
<p class="inline">Kubernetes is an open source system and many companies have begun to adopt it into their existing architecture as well as adapt it to their specific needs. It was originally developed by Google and was made an open source project in 2014. The Cloud Native Computing Foundation is a project of the Linux Foundation providing a community for different companies who are seeking to develop Kubernetes and other container orchestration projects. Several major cloud providers and platforms including Google Cloud Compute, HP Helion Cloud, RedHat Openshift, VMware Cloud, and Windows Azure all support the use of Kubernetes<ref>CENGN. (2018). CENGN and CloudOps Collaborate to Train Industry on Docker and Kubernetes.<i>[Available at: https://www.cengn.ca/docker-kubernetes-training-jan18/ ]</i></ref>. A survey, performed by iDatalabs in 2017, found 2,867 companies are currently using Kubernetes. These companies are generally located in the United States and are also most the computer software industry. Companies on the list hire between 50 and 200 employees, and accumulate 1M-100M in revenue per year. Some of the major companies on this list include GoDaddy inc, Pivotal Software inc, Globant SA, and Splunk inc</p><p class="expand inline mw-collapsible-content">. Kubernetes own approximately 8.6% of the market share within the virtualization management software category <ref>Idatalabs.com. (2018). Kubernetes commands 8.62% market share in Virtualization Management Software<i>[https://idatalabs.com/tech/products/kubernetes] </i></ref>. </p>
+
  <p>There are a variety of ways Data Lakes are being used in the industry:</p>
 +
  <ul>
 +
    <li><p><b>Ingestion of semi-structured and unstructured data sources (aka big data)</b>such as equipment readings, telemetry data, logs, streaming data, and so forth. A Data Lake is a great solution for storing IoT (Internet of Things) type of data which has traditionally been more difficult to store, and can support near real-time analysis. Optionally, you can also add structured data (i.e., extracted from a relational data source) to a Data Lake if your objective is a single repository of all data to be available via the lake.</p></li>
 +
    <li><p><b>Experimental analysis </b>of data before its value or purpose has been fully defined. Agility is important for every business these days, so a Data Lake can play an important role in "proof of value" type of situations because of the "ELT" approach discussed above.</p></li>
 +
    <li><p><b>Advanced analytics support. </b>A Data Lake is useful for data scientists and analysts to provision and experiment with data.</p></li>
 +
    <li><p><b>Archival and historical data storage. </b>Sometimes data is used infrequently, but does need to be available for analysis. A Data Lake strategy can be very valuable to support an active archive strategy.</p></li>
 +
    <li><p><b>Distributed processing </b>capabilities associated with a logical data warehouse.</p></li>
 +
  </ul>
 +
  <b class="expand mw-collapsible-content">How TD Bank Made Its Data Lake More Usa</b>
 +
  <p class="expand mw-collapsible-content">[[https://www.datanami.com/2017/10/03/td-bank-made-data-lake-usable/]]<br>Toronto-Dominion Bank (TD Bank) is one of the largest banks in North America, with 85,000 employees, more than 2,400 locations between Canada and the United States, and assets nearing $1 trillion. In 2014, the company decided to standardize how it warehouses data for various business intelligence and regulatory reporting functions. The company purchased a Hadoop distribution and set off to build a large cluster that could function as a centralized lake to store data originating from a variety of departments.</p>
 +
 
 
   <h2>Canadian Government Use</h2>
 
   <h2>Canadian Government Use</h2>
<p>There is a lack of documented Government of Canada (GC) initiatives and programs promoting the current and future use of Kubernetes technology. As a GC strategic IT item, Kubernetes is absent from both the GC’s Digital Operations Strategic Plan: 2018-2022 and the GC Strategic Plan for Information Management and Information Technology 2017 to 2021. This may be due to the fact that the GC is currently grappling with the implementation of Cloud Services, and the majority of resources and efforts are occupied with implementation challenges, as well as security concerns related to the protection of the information of Canadians.</p>
+
<p>In 2019, the Treasury Board of Canada Secretariat (TBS), partnered with Shared Services Canada and other departments, to identify a business lead to develop a Data Lake (a repository of raw data) service strategy so that the GC can take advantage of big data and market innovation to foster better analytics and promote horizontal data-sharing. </p>
<p class="expand mw-collapsible-content">However, the inception of containers into the market has shown that large-scale organizations, who are involved in cloud-native application development as well as networking, can benefit greatly from the use of containers <ref>CENGN. (2018). CENGN and CloudOps Collaborate to Train Industry on Docker and Kubernetes<i>[Available at: https://www.cengn.ca/docker-kubernetes-training-jan18/]</i></ref>. Although the infrastructure applications providing cloud services can be based solely on Virtual Machines (VMs), the maintenance costs associated with running different operating systems on individual VMs outweighs the benefit <ref>Heron, P. (2018). Experimenting with containerised infrastructure for GOV.UK - Inside GOV.UK. [online] Insidegovuk.blog.gov.uk<i>[https://insidegovuk.blog.gov.uk/2017/09/15/experimenting-with-containerised-infrastructure-for-gov-uk/ ]</i></ref>. Containers and Containerization is a replacement and/or complimentary architecture for VMs. As the GC moves toward cloud services and development of cloud-native applications, the use of containers and orchestrating them with Kubernetes can become an integral part the GC IT architecture. </p>
+
<p class="expand mw-collapsible-content">Big data is the technology that stores and processes data and information in datasets that are so large or complex that traditional data processing applications can’t analyze them. Big data can make available almost limitless amounts of information, improving data-driven decision-making and expanding open data initiatives. Business intelligence involves creating, aggregating, analyzing and visualizing data to inform and facilitate business management and strategy. TBS, working with departments, will lead the development of requirements for an enterprise analytics platform.</p>
 
+
<p>Data Lake development in the GC is a more recent initiative. This is mainly due to the GC focussing resources on the implementation of cloud initiatives. However, there are some GC departments engaged in developing Data Lake environments in tandem to cloud initiatives.</p>
 +
<p class="expand mw-collapsible-content">Notably, the Employment and Social Development Canada (ESDC) is preparing the installment of multiple Data Lakes in order to enable a Data Lake Ecosystem and Data Analytics and Machine Learning toolset. This will enable ESDC to share information horizontally both effectively and safely, while enabling a wide variety of data analytics capabilities. ESDC aims to maintain current data and analytics capabilities up-to-date while exploring new ones to mitigate gaps and continuously evolve our services to meet client’s needs. </p>
 
   <h2>Implications for Government Agencies</h2>
 
   <h2>Implications for Government Agencies</h2>
 
   <h3>Shared Services Canada (SSC)</h3>
 
   <h3>Shared Services Canada (SSC)</h3>
 
   <h4>Value Proposition</h4>
 
   <h4>Value Proposition</h4>
   <p>The primary business value impact of Kubernetes is the technology’s portability, and mobility independent of the environment. Its ability to manage, and orchestrate an organization’s application containers is a marked benefit. Kubernetes secondary business value is that it enables enterprise high-velocity, meaning that every product team can safely ship updates many times a day, deploy instantly, observe results in real time, and use this feedback to roll containers forward or back with the goal to improve the customer experience as fast as possible<ref>Jayanandana, Nilesh. (May 2nd, 2018). Benefits of Kubernetes. Medium Newspaper. Retrieved 16-May-2019 from: <i>[https://medium.com/platformer-blog/benefits-of-kubernetes-e6d5de39bc48]</i></ref>. </p>
+
   <p class="expand mw-collapsible-content">There are three common value propositions for pursuing Data Lakes. 1) It can provide an easy and accessible way to obtain data faster; 2) It can create a singular inflow point of data to help connect and merge information silos in an organization; and 3) It can provide an experimental environment for experienced data scientists to enable new analytical insights.</p>
   <p>In the age of modern web services, users expect their applications to be available 24/7, and developers expect the ability to deploy new versions of those applications several times a day with minimal downtime. Containers have become one of the main ways in which to manage applications across enterprise IT infrastructure and also one of the most difficult areas to manage effectively.</p>
+
   <p class="inline">Data Lakes can provide data to consumers more quickly by offering data in a more raw and easily accessible form. Data is stored in its native form with little to no processing, it is optimized to store vast amounts of data in their native formats. By allowing the data to remain in its native format, a much timelier stream of data is available for unlimited queries and analysis. A Data Lake can help data consumers bypass strict data retrieval and data structured applications such as a data warehouse and/or data mart. This has the effect of improving a business’ data flexibility.</p><p class="expand inline mw-collapsible-content">Some companies have in fact used Data Lakes to replace existing warehousing environments where implementing a new data warehouse is more cost prohibitive. A Data Lake can contain unrefined data, this is helpful when either a business data structure is unknown, or when a data consumer requires access to the data quickly. </p>
  <p>Kubernetes, as an open source system, is a technology that can administer and manage a large number of containerized applications spread across clusters of servers while providing basic mechanisms for deployment, maintenance, and scaling of applications<ref>GitHub. (2019). Production-Grade Container Scheduling and Management. GitHub. 2019. Retrieved 16-May-2019 from: <i>[https://github.com/kubernetes/kubernetes ]</i></ref>.  An application container is a standard unit of software that packages code and all its dependencies so the application runs quickly and reliably from one computing environment to another<ref>Docker. (2019). What is a Container? A Standardized Unit of Software. Docker Inc. 2019.Retrieved 16-May-2019 from: <i>[https://www.docker.com/resources/what-container ]</i></ref>.  Kubernetes automates the distribution and scheduling of application containers across a cluster in a more efficient way<ref>Kubernetes. (2019). Using Minikube to Create a Cluster. Kubernetes. 2019. ICP license: 京ICP备17074266号-3. Retrieved 16-May-2019 from: <i>[https://kubernetes.io/docs/tutorials/kubernetes-basics/create-cluster/cluster-intro/ ]</i></ref>.  </p>
+
  <p></p>
  <p>Containers offer a logical packaging mechanism in which applications can be abstracted from the environment in which they actually run. This decoupling allows container-based applications to be deployed easily and consistently, regardless of whether the target environment is a private data center, the public cloud, or even a developer’s personal laptop<ref><i>[https://cloud.google.com/containers/ ]</i></ref>.  An additional benefit to containerization is that the Operating System (OS) is not running as hard. </p>
+
   <p class="inline">A Data Lake is not a single source of truth. A Data Lake is a central location in which data converges from all data sources and is stored, regardless of the data formatting. </p><p class="expand inline mw-collapsible-content">As a singular point for the inflow of data, sections of a business can pool their information together in the Data Lake and increase the sharing of information with other parts of the organization. In this way everyone in the organization has access to the data. A Data Lake can increase the horizontal data sharing within an organization by creating this singular data inflow point. Using a variety of storage and processing tools analysts can extract data value quickly in order to inform key business decisions.</p>
   <p class="inline">Since Kubernetes is open source, it allows the enterprise freedom to take advantage of on-premises, hybrid, or public cloud infrastructure, and the ability to effortlessly move workloads<ref>Kubernetes. (2019). Production-Grade Container Orchestration. Kubernetes. 2019. ICP license: 京ICP备17074266号-3. Retrieved 16-May-2019 from: <i>[https://kubernetes.io/ ]</i></ref>.  Containerized applications are more flexible and available than in past deployment models, where applications were installed directly onto specific machines as packages deeply integrated into the host. Kubernetes groups containers that make up an application into logical units for easy management and discovery. </p><p class="expand inline mw-collapsible-content">The abstractions in Kubernetes allows deployment of containerized applications to a cluster without tying them specifically to individual machines (i.e. Virtual Machines). Applications can be co-located on the same machines without impacting each other. This means that tasks from multiple users can be packed onto fewer machines. This provides greater efficiency and reduces the cost on hardware as less machines are used. </p>
+
  <p></p>
  <p>Kubernetes contains tools for orchestration, secrets management, service discovery, scaling and load balancing and includes automatic bin packing to place containers with the optimal resources, and it applies configurations via configuration management features<ref>Rouse, Margaret, et al. (August 2017). Kubernetes. TechTarget Inc. 2019. Retrieved 16-May-2019 from: <i>[https://searchitoperations.techtarget.com/definition/Google-Kubernetes ]</i></ref>.  It protects container workloads by rolling out or rolling back changes and offers availability and quality checks for containers -- replacing or restarting failed containers. As requirements change, a user can move container workloads in Kubernetes from one cloud provider or hosting infrastructure to another without changing the code<ref>Rouse, Margaret, et al. (August 2017). Kubernetes. TechTarget Inc. 2019. Retrieved 16-May-2019 from: <i>[https://searchitoperations.techtarget.com/definition/Google-Kubernetes]</i></ref>.  This is a great value to developers as their work is protected and an audit trail of changes is available.</p>
+
   <p class="expand inline mw-collapsible-content">A Data Lake is optimized for exploration and provides an experimental environment for experienced data scientists to uncover new insights from data. Analysts can overlay context on the data to extract value. All organizations want to increase analytics and operational agility.</p><p class="inline">The Data Lake architectural approach can store large volumes of data, this can be a way in which cross-cutting teams can pool their data in a central location and by complementing their systems of record with systems of insight. </p><p class="expand inline mw-collapsible-content">Data Lakes present the most potential benefits for experienced and competant data scientists. </p><p class="inline">Having structured, unstructured and semistructured data, usually in the same data set, can contain business, predictive, and prescriptive insights previously not possible from a structured platform as observed in data warehouses and data marts.</p>
   <p class="expand mw-collapsible-content">The core concepts of Kubernetes which enables high velocity are immutability, declarative configuration and self-healing systems<ref>Jayanandana, Nilesh. (May 2nd, 2018). Benefits of Kubernetes. Medium Newspaper. Retrieved 16-May-2019 from: <i>[https://medium.com/platformer-blog/benefits-of-kubernetes-e6d5de39bc48]</i></ref>. </p>
+
 
  <p>Containers and Kubernetes encourage developers to build distributed systems that adhere to the principles of immutable infrastructure. In immutable infrastructure an artifact created, will not be changed upon user modifications. To update applications in an immutable infrastructure, a new container image is built with a new tag, and is deployed, terminating the old container with the old image version. In this way, the enterprise always has an artifact record of what was done and if there was an error in the new image. If an error is detected the container is rolled back to the previous image<ref>Jayanandana, Nilesh. (May 2nd, 2018). Benefits of Kubernetes. Medium Newspaper. Retrieved 16-May-2019 from: <i>[https://medium.com/platformer-blog/benefits-of-kubernetes-e6d5de39bc48]</i></ref>.  Anything that goes into a container has a text file. Text files can be treated like application source code and provisions immutability.</p>
  −
  <p class="expand mw-collapsible-content">Declarative configuration enables the user to describe exactly what state the system should be in. Traditional tools of development such as source control, unit tests etc. can be used with declarative configurations in ways that are impossible with imperative configurations. Imperative systems describe how to get from point A to B, but rarely include reverse instructions to get back. Kubernetes declarative configuration makes rollbacks fairly easy which is impossible with imperative configurations<ref>Jayanandana, Nilesh. (May 2nd, 2018). Benefits of Kubernetes. Medium Newspaper. Retrieved 16-May-2019 from: <i>[https://medium.com/platformer-blog/benefits-of-kubernetes-e6d5de39bc48]</i></ref>. </p>
  −
  <p class="expand mw-collapsible-content">Lastly, Kubernetes has a means of self-healing. When Kubernetes receives a desired state configuration, it does not simply take actions to make the current state match the desired state at a single time, but it will continuously take actions to ensure it stays that way as time passes by<ref>Jayanandana, Nilesh. (May 2nd, 2018). Benefits of Kubernetes. Medium Newspaper. Retrieved 16-May-2019 from: <i>[https://medium.com/platformer-blog/benefits-of-kubernetes-e6d5de39bc48]</i></ref>. </p>
   
   <h4>Challenges</h4>
 
   <h4>Challenges</h4>
 
   <p>The greatest challenge in regards to Kubernetes is its complexity. However, security, storage and networking, maturity, and competing enterprise transformation priorities are also challenges facing the Kubernetes technology.</p><br><b>Kubernetes Complexity and Analyst Experience</b>
 
   <p>The greatest challenge in regards to Kubernetes is its complexity. However, security, storage and networking, maturity, and competing enterprise transformation priorities are also challenges facing the Kubernetes technology.</p><br><b>Kubernetes Complexity and Analyst Experience</b>
105

edits