Changes

Technology Trends/Datalakes (view source)

Revision as of 14:35, 18 July 2019

1,256 bytes removed , 14:35, 18 July 2019

no edit summary

Line 136: Line 136:

<h4>Challenges</h4>

−

~~The greatest challenge in regards to Kubernetes is~~ its ~~complexity. However, security, storage and networking, maturity, and competing enterprise transformation priorities are also~~ challenges ~~facing the Kubernetes technology~~.~~ ~~~~Kubernetes Complexity~~ and ~~Analyst Experience~~

+

Although Data Lake technology has many benefits for organizations dealing with big data it has its own challenges. For example:

−

~~There~~ is ~~the challenge of~~ a lack ~~of organizational and analyst experience with container management and in using Kubernetes. Managing~~, ~~updating~~, and ~~changing~~ a ~~Kubernetes cluster can be operationally complex~~, ~~more so if~~ the ~~analysts have a lack of experience~~. ~~The system itself does provide a solid base of infrastructure~~ for ~~a Platform as a Service (PaaS) framework, which can reduce the complexity for developers~~. ~~However, testing within a Kubernetes environment is still a complex task. Although its use cases in testing~~ are ~~well noted~~, ~~testing several moving parts of an infrastructure~~ to ~~determine proper application functionality is still a more difficult endeavour <ref>Clayton, T. and Watson, R. (2018). Using Kubernetes~~ to ~~Orchestrate Container-Based Cloud and Microservices Applications~~. ~~[online] Gartner.com. Available at:~~ <~~i>[https:~~/~~/www.gartner.com/doc/3873073/using-kubernetes-orchestrate-containerbased-cloud]</i~~><~~/ref~~>. ~~This means~~ a ~~lot of new learning~~ will ~~be needed for operations teams developing and managing Kubernetes infrastructure~~. ~~The larger~~ the ~~company,~~ the more likely the Kubernetes user is to face container challenges<ref>Williams, Alex, et al. Kubernetes Deployment & Security Patterns. The New Stack. 2019. 20180622. thenewstack.io. Retrieved 15-May-2019 from: [https://thenewstack.io/ebooks/kubernetes/kubernetes-deployment-and-security-patterns/]</ref>. ~~ ~~~~Security~~

+

Data Governance and Semantic Issues

−

~~In a distributed, highly scalable environment, traditional and typical security patterns will not cover all threats. Security will have to be aligned for containers~~ and in the ~~context of Kubernetes. It~~ is ~~critical~~ for ~~operations teams~~ to ~~understand Kubernetes security in terms of containers~~, ~~deployment~~, and ~~network security. Security perimeters are porous~~, ~~containers must be secured at the node level, but also through the image~~ and ~~registry. Security practices in the context~~ of ~~various deployment models will~~ be ~~a persistent challenge<ref>Williams, Alex, et al~~. ~~Kubernetes Deployment & Security Patterns~~. ~~The New Stack. 2019. 20180622. thenewstack.io. Retrieved 15-May-2019 from: [https://thenewstack.io/ebooks/kubernetes/kubernetes-deployment-~~and~~-security-patterns/]</ref>~~. ~~ ~~~~Storage & Networking~~

+

The biggest challenge for Data Lakes is to resolve assorted data governance requirements in a single centralized data platform. Data Lakes fail mostly when they lack governance, self-disciplined users, and a rational data flow.Often, Data Lake implementations are focused on storing data instead of managing the data. Data Lakes are not optimized for semantic enforcement or consistency. They are made for semantic flexibility, to allow anyone to provide context to data if they have the skills to do so.

−

~~Storage and networking technologies are pillars of~~ data ~~center~~ infrastructure~~, but were designed originally for client/server and virtualized environments. Container technologies are leading companies~~ to ~~rethink how storage~~ and ~~networking technologies function and operate<ref>Williams~~, ~~Alex, et al~~. ~~Kubernetes Deployment & Security Patterns. The New Stack. 2019. 20180622. thenewstack.io. Retrieved 15-May-2019 from:~~ <~~i>[https:~~/~~/thenewstack.io/ebooks/kubernetes/kubernetes-deployment-and-security-patterns/]</i~~><~~/ref~~>. ~~Architectures are becoming more application-oriented and storage does~~ not ~~necessarily live on the same machine as the application~~ or ~~its services. Larger companies tend to run more containers,~~ and to ~~do so~~ in ~~scaled-out production environments requires new approaches to infrastructure<ref>Williams~~, ~~Alex~~, ~~et al~~. ~~Kubernetes Deployment & Security Patterns. The New Stack. 2019. 20180622. thenewstack.io. Retrieved 15~~-~~May-2019 from: [https://thenewstack~~.~~io/ebooks/kubernetes/kubernetes-deployment-and-security-patterns/]~~<~~/ref~~>.

+

Putting data in the same place does not remove it’s ambiguity or meaning. Data Lakes provide unconstrained, “no compromises” storage model environment without the data governance assurances common to data warehouses or data marts. Proper meta data is essential for a Data Lake, without appropriate meta data the Data Lake will not work as intended. It is beneficial to think of meta data as the fish finder in the Data Lake.

−

~~Some legacy systems can run containers and only sometimes can VMs can be replaced by containers. There~~ may be ~~significant engineering consequences to existing legacy systems if containerization~~ and ~~Kubernetes is implemented~~ in ~~a legacy system not designed to handle that change~~. ~~Some Legacy systems may require refactoring and making it more suitable for containerization. Some pieces~~ of ~~a system~~ may ~~be able~~ to ~~be broken off and containerized~~. In ~~general, anything facing~~ the ~~internet should~~ be ~~run~~ in ~~containers~~.</p~~> Maturity</b~~>

+

Lack of Quality and Trust in Data

−

~~Kubernetes maturity as a technology is still being tested by organizations. For now, Kubernetes is the market leader and the standardized means~~ of ~~orchestrating containers~~ and ~~deploying distributed applications~~. ~~Google is the primary commercial organization behind Kubernetes; however they do not support Kubernetes as~~ a ~~software product. It offers a commercial managed Kubernetes service known as GKE~~ but ~~not as~~ a ~~software~~. This ~~can be viewed as both a strength and a weakness. Without commercialization, the user~~ is ~~granted more flexibility with how Kubernetes can be implemented in~~ their ~~infrastructure; However~~, ~~without a concrete set~~ of ~~standards~~ of ~~the services that Kubernetes can offer, there~~ is ~~a risk that Google’s continuous support cannot be guaranteed~~. ~~Its donation~~ of ~~Kubernetes code~~ and ~~intellectual property to the Cloud Native Computing Foundation does minimize this risk since there is still~~ an ~~organization enforcing the proper standards~~ and ~~verifying services Kubernetes~~ can ~~offer moving forward <ref>Clayton, T.~~ and ~~Watson~~, ~~R. (2018). Using Kubernetes~~ to ~~Orchestrate Container-Based Cloud~~ and ~~Microservices Applications~~. ~~[online] Gartner.com. Available at:~~ ~~[https://www.gartner.com/doc/3873073/using~~-~~kubernetes~~-~~orchestrate-containerbased-cloud]~~<~~/ref~~>. ~~It is also important to note~~ that the ~~organizational challenges that Kubernetes users face have been more dependent on the size~~ of ~~the organization using it~~.

+

Data quality and trust in the data is a perennial issue for many organizations. Although data discovery tools can apply Machine Learning across related datasets from multiple data sources to identify anomalies (incorrect values, missing values, duplicates and outdated data), quality and trustworthiness of data continue to be an issue for Data Lakes who can easily become data dumping grounds. Some data is more accurate than others. This can present a real problem for anyone using multiple data sets and making decisions based upon analysis conducted with data of varying degrees of quality.

−

~~Kubernetes faces competition from other scheduler and orchestrator technologies, such as Docker Swarm and Mesosphere DC/OS. While Kubernetes~~ is ~~sometimes used~~ to ~~manage Docker containers, it also competes with the native clustering capabilities~~ of ~~Docker Swarm<ref>Rouse, Margaret, et al~~. (August 2017). Kubernetes. TechTarget Inc. 2019. Retrieved 16-May-2019 from: [https://searchitoperations.techtarget.com/definition/Google-Kubernetes]</ref>. However, Kubernetes can be run on a public cloud service or on-premises, is ~~highly modular, open source, and has~~ a ~~vibrant community. Companies~~ of ~~all sizes~~ are ~~investing into it, and many cloud providers offer Kubernetes as~~ a ~~service<ref>Tsang, Daisy~~. ~~(February 12th, 2018)~~. ~~Kubernetes vs~~. ~~Docker: What Does It Really Mean? Sumo Logic. 2019. Retrieved 16-May-2019 from:~~ ~~[https://www.sumologic.com/blog/kubernetes-vs-docker/ ]~~</~~ref~~>. </p~~> <b~~ class="~~expand mw-collapsible-content~~">~~Competing Enterprise Transformation Priorities~~

+

Data Swamps, Performance, and Flexibility Challenges

−

~~The last challenge facing Kubernetes initiative development~~ and ~~implementation is its place in an organization’s IT transformation priority list~~. ~~Often there are many higher priority initiatives that can take president over Kubernetes projects~~.

+

Data stored in Data Lakes can sometimes become muddy when good data is mixed with bad data. Data Lake infrastructure is meant to store and process large amounts of data, usually in massive data files. . A Data Lake is not optimized for a high number of users or diverse and simultaneous workloads due to intensive query tasks. This can result in performance degradation and failures are common when running extractions, transformations, and loading tasks all at the same time. On-premises Data Lakes face other performance challenges in that they have a static configuration.

−

+

Data Hoarding and Storage Capacity

−

+

Data stored in Data Lakes may actually never be used in production and stay unused indefinitely in the Data Lake. By storing massive amounts of historical data, the infinite Data Lake may skew analysis with data that is no longer relevant to the priorities of the business. In keeping the historical data the metadata describing it must be understood as well. This decreases the performance of the Data Lake by increasing the overall workload of employees to clean the datasets no longer in use for analysis.

+

Storing increasingly massive amounts of data for an unlimited time will also lead to scalability and cost challenges. Scalability challenges are less of a risk in public cloud environments, but cost remains a factor. On-premises Data Lakes are more susceptible to cost challenges. This is because their cluster nodes require all three dimensions of computing (storage, memory and processing). Organizations of all kinds generate massive amounts of data (including meta data) and it is increasing exponentially.

+

The storage capacity of all this data (and future data) will be an ongoing challenge and one that will require constant management. While Data Lakes can and will be stored on the cloud, SSC as cloud broker for the GC will need to provide the appropriate infrastructure and scalability to clients.

+

Advanced Users Required

+

Data Lakes are not a platform to be explored by everyone. Data Lakes present an unrefined view of data that usually only the most highly skilled analysts are able to explore and engage in data refinement independent of any other formal system-of-record such as a data warehouse.

+

Not just anyone in an organization is data-literate enough to derive value from large amounts of raw or uncurated data. The reality is only a handful of staff are skilled enough to navigate a Data Lake. Since Data Lakes store raw data their business value is entirely determined by the skills of Data Lake users. These skills are often lacking in an organization.

+

Data Security

+

Data in a Data Lake lacks standard security protection with a relational database management system or an enterprise database. In practice, this means that the data is unencrypted and lacks access control.. Security is not just a binary solution. We have varying degrees of security (unclassified, secret, top secret, etc.) and all of which require different approaches. This will inevitably present challenges with the successful use of data from Data Lakes.To combat this, organizations will have to embrace a new security framework to be compatable with Data Lakes and Data Scientists.

<h4>Considerations</h4>

Strategic Resourcing and Network Planning

Kpere060

105

edits

Changes

Technology Trends/Datalakes (view source)

Revision as of 14:35, 18 July 2019

Navigation menu

Search