Line 136: |
Line 136: |
| | | |
| <h4>Challenges</h4> | | <h4>Challenges</h4> |
− | <p>The greatest challenge in regards to Kubernetes is its complexity. However, security, storage and networking, maturity, and competing enterprise transformation priorities are also challenges facing the Kubernetes technology.</p><br><b>Kubernetes Complexity and Analyst Experience</b> | + | <p>Although Data Lake technology has many benefits for organizations dealing with big data it has its own challenges. For example:</p> |
− | <p>There is the challenge of a lack of organizational and analyst experience with container management and in using Kubernetes. Managing, updating, and changing a Kubernetes cluster can be operationally complex, more so if the analysts have a lack of experience. The system itself does provide a solid base of infrastructure for a Platform as a Service (PaaS) framework, which can reduce the complexity for developers. However, testing within a Kubernetes environment is still a complex task. Although its use cases in testing are well noted, testing several moving parts of an infrastructure to determine proper application functionality is still a more difficult endeavour <ref>Clayton, T. and Watson, R. (2018). Using Kubernetes to Orchestrate Container-Based Cloud and Microservices Applications. [online] Gartner.com. Available at: <i>[https://www.gartner.com/doc/3873073/using-kubernetes-orchestrate-containerbased-cloud]</i></ref>. This means a lot of new learning will be needed for operations teams developing and managing Kubernetes infrastructure. The larger the company, the more likely the Kubernetes user is to face container challenges<ref>Williams, Alex, et al. Kubernetes Deployment & Security Patterns. The New Stack. 2019. 20180622. thenewstack.io. Retrieved 15-May-2019 from: <i>[https://thenewstack.io/ebooks/kubernetes/kubernetes-deployment-and-security-patterns/]</i></ref>. </p><br><b>Security</b> | + | <b>Data Governance and Semantic Issues</b> |
− | <p>In a distributed, highly scalable environment, traditional and typical security patterns will not cover all threats. Security will have to be aligned for containers and in the context of Kubernetes. It is critical for operations teams to understand Kubernetes security in terms of containers, deployment, and network security. Security perimeters are porous, containers must be secured at the node level, but also through the image and registry. Security practices in the context of various deployment models will be a persistent challenge<ref>Williams, Alex, et al. Kubernetes Deployment & Security Patterns. The New Stack. 2019. 20180622. thenewstack.io. Retrieved 15-May-2019 from: <i>[https://thenewstack.io/ebooks/kubernetes/kubernetes-deployment-and-security-patterns/]</i></ref>. </p><br><b>Storage & Networking</b> | + | <p class="expand inline mw-collapsible-content">The biggest challenge for Data Lakes is to resolve assorted data governance requirements in a single centralized data platform. Data Lakes fail mostly when they lack governance, self-disciplined users, and a rational data flow.</p><p class="inline">Often, Data Lake implementations are focused on storing data instead of managing the data. Data Lakes are not optimized for semantic enforcement or consistency. They are made for semantic flexibility, to allow anyone to provide context to data if they have the skills to do so. </p> |
− | <p>Storage and networking technologies are pillars of data center infrastructure, but were designed originally for client/server and virtualized environments. Container technologies are leading companies to rethink how storage and networking technologies function and operate<ref>Williams, Alex, et al. Kubernetes Deployment & Security Patterns. The New Stack. 2019. 20180622. thenewstack.io. Retrieved 15-May-2019 from: <i>[https://thenewstack.io/ebooks/kubernetes/kubernetes-deployment-and-security-patterns/]</i></ref>. Architectures are becoming more application-oriented and storage does not necessarily live on the same machine as the application or its services. Larger companies tend to run more containers, and to do so in scaled-out production environments requires new approaches to infrastructure<ref>Williams, Alex, et al. Kubernetes Deployment & Security Patterns. The New Stack. 2019. 20180622. thenewstack.io. Retrieved 15-May-2019 from: <i>[https://thenewstack.io/ebooks/kubernetes/kubernetes-deployment-and-security-patterns/]</i></ref>. </p> | + | <p>Putting data in the same place does not remove it’s ambiguity or meaning. Data Lakes provide unconstrained, “no compromises” storage model environment without the data governance assurances common to data warehouses or data marts. Proper meta data is essential for a Data Lake, without appropriate meta data the Data Lake will not work as intended. It is beneficial to think of meta data as the fish finder in the Data Lake.</p> |
− | <p>Some legacy systems can run containers and only sometimes can VMs can be replaced by containers. There may be significant engineering consequences to existing legacy systems if containerization and Kubernetes is implemented in a legacy system not designed to handle that change. Some Legacy systems may require refactoring and making it more suitable for containerization. Some pieces of a system may be able to be broken off and containerized. In general, anything facing the internet should be run in containers.</p><br><b>Maturity</b> | + | <b class="expand mw-collapsible-content">Lack of Quality and Trust in Data</b> |
− | <p>Kubernetes maturity as a technology is still being tested by organizations. For now, Kubernetes is the market leader and the standardized means of orchestrating containers and deploying distributed applications. Google is the primary commercial organization behind Kubernetes; however they do not support Kubernetes as a software product. It offers a commercial managed Kubernetes service known as GKE but not as a software. This can be viewed as both a strength and a weakness. Without commercialization, the user is granted more flexibility with how Kubernetes can be implemented in their infrastructure; However, without a concrete set of standards of the services that Kubernetes can offer, there is a risk that Google’s continuous support cannot be guaranteed. Its donation of Kubernetes code and intellectual property to the Cloud Native Computing Foundation does minimize this risk since there is still an organization enforcing the proper standards and verifying services Kubernetes can offer moving forward <ref>Clayton, T. and Watson, R. (2018). Using Kubernetes to Orchestrate Container-Based Cloud and Microservices Applications. [online] Gartner.com. Available at: <i>[https://www.gartner.com/doc/3873073/using-kubernetes-orchestrate-containerbased-cloud]</i></ref>. It is also important to note that the organizational challenges that Kubernetes users face have been more dependent on the size of the organization using it.</p> | + | <p class="expand mw-collapsible-content">Data quality and trust in the data is a perennial issue for many organizations. Although data discovery tools can apply Machine Learning across related datasets from multiple data sources to identify anomalies (incorrect values, missing values, duplicates and outdated data), quality and trustworthiness of data continue to be an issue for Data Lakes who can easily become data dumping grounds. Some data is more accurate than others. This can present a real problem for anyone using multiple data sets and making decisions based upon analysis conducted with data of varying degrees of quality.</p> |
− | <p>Kubernetes faces competition from other scheduler and orchestrator technologies, such as Docker Swarm and Mesosphere DC/OS. While Kubernetes is sometimes used to manage Docker containers, it also competes with the native clustering capabilities of Docker Swarm<ref>Rouse, Margaret, et al. (August 2017). Kubernetes. TechTarget Inc. 2019. Retrieved 16-May-2019 from: <i>[https://searchitoperations.techtarget.com/definition/Google-Kubernetes]</i></ref>. However, Kubernetes can be run on a public cloud service or on-premises, is highly modular, open source, and has a vibrant community. Companies of all sizes are investing into it, and many cloud providers offer Kubernetes as a service<ref>Tsang, Daisy. (February 12th, 2018). Kubernetes vs. Docker: What Does It Really Mean? Sumo Logic. 2019. Retrieved 16-May-2019 from: <i>[https://www.sumologic.com/blog/kubernetes-vs-docker/ ]</i></ref>. </p><br><b class="expand mw-collapsible-content">Competing Enterprise Transformation Priorities</b> | + | <b>Data Swamps, Performance, and Flexibility Challenges</b> |
− | <p class="expand mw-collapsible-content">The last challenge facing Kubernetes initiative development and implementation is its place in an organization’s IT transformation priority list. Often there are many higher priority initiatives that can take president over Kubernetes projects.</p>
| + | <p class="expand inline mw-collapsible-content">Data stored in Data Lakes can sometimes become muddy when good data is mixed with bad data. Data Lake infrastructure is meant to store and process large amounts of data, usually in massive data files. </p><p class="inline">. A Data Lake is not optimized for a high number of users or diverse and simultaneous workloads due to intensive query tasks. This can result in performance degradation and failures are common when running extractions, transformations, and loading tasks all at the same time. On-premises Data Lakes face other performance challenges in that they have a static configuration. </p> |
− | | + | <b class="expand mw-collapsible-content">Data Hoarding and Storage Capacity</b> |
− | | + | <p class="expand mw-collapsible-content">Data stored in Data Lakes may actually never be used in production and stay unused indefinitely in the Data Lake. By storing massive amounts of historical data, the infinite Data Lake may skew analysis with data that is no longer relevant to the priorities of the business. In keeping the historical data the metadata describing it must be understood as well. This decreases the performance of the Data Lake by increasing the overall workload of employees to clean the datasets no longer in use for analysis.</p> |
| + | <p class="expand mw-collapsible-content">Storing increasingly massive amounts of data for an unlimited time will also lead to scalability and cost challenges. Scalability challenges are less of a risk in public cloud environments, but cost remains a factor. On-premises Data Lakes are more susceptible to cost challenges. This is because their cluster nodes require all three dimensions of computing (storage, memory and processing). Organizations of all kinds generate massive amounts of data (including meta data) and it is increasing exponentially.</p> |
| + | <p>The storage capacity of all this data (and future data) will be an ongoing challenge and one that will require constant management. While Data Lakes can and will be stored on the cloud, SSC as cloud broker for the GC will need to provide the appropriate infrastructure and scalability to clients.</p> |
| + | <b class="expand mw-collapsible-content">Advanced Users Required</b> |
| + | <p class="expand mw-collapsible-content">Data Lakes are not a platform to be explored by everyone. Data Lakes present an unrefined view of data that usually only the most highly skilled analysts are able to explore and engage in data refinement independent of any other formal system-of-record such as a data warehouse. </p> |
| + | <p class="expand mw-collapsible-content">Not just anyone in an organization is data-literate enough to derive value from large amounts of raw or uncurated data. The reality is only a handful of staff are skilled enough to navigate a Data Lake. Since Data Lakes store raw data their business value is entirely determined by the skills of Data Lake users. These skills are often lacking in an organization.</p> |
| + | <b>Data Security</b> |
| + | <p class="inline"></p>Data in a Data Lake lacks standard security protection with a relational database management system or an enterprise database. In practice, this means that the data is unencrypted and lacks access control.<p class="expand inline mw-collapsible-content">. Security is not just a binary solution. We have varying degrees of security (unclassified, secret, top secret, etc.) and all of which require different approaches. This will inevitably present challenges with the successful use of data from Data Lakes.To combat this, organizations will have to embrace a new security framework to be compatable with Data Lakes and Data Scientists.</p> |
| <h4>Considerations</h4> | | <h4>Considerations</h4> |
| <b>Strategic Resourcing and Network Planning</b> | | <b>Strategic Resourcing and Network Planning</b> |