Changes

Technology Trends/Datalakes (view source)

Revision as of 14:50, 18 July 2019

1,862 bytes removed , 14:50, 18 July 2019

no edit summary

Line 143: Line 143:

Data quality and trust in the data is a perennial issue for many organizations. Although data discovery tools can apply Machine Learning across related datasets from multiple data sources to identify anomalies (incorrect values, missing values, duplicates and outdated data), quality and trustworthiness of data continue to be an issue for Data Lakes who can easily become data dumping grounds. Some data is more accurate than others. This can present a real problem for anyone using multiple data sets and making decisions based upon analysis conducted with data of varying degrees of quality.

Data Swamps, Performance, and Flexibility Challenges

−

Data stored in Data Lakes can sometimes become muddy when good data is mixed with bad data. Data Lake infrastructure is meant to store and process large amounts of data, usually in massive data files. . A Data Lake is not optimized for a high number of users or diverse and simultaneous workloads due to intensive query tasks. This can result in performance degradation and failures are common when running extractions, transformations, and loading tasks all at the same time. On-premises Data Lakes face other performance challenges in that they have a static configuration.

+

Data stored in Data Lakes can sometimes become muddy when good data is mixed with bad data. Data Lake infrastructure is meant to store and process large amounts of data, usually in massive data files. A Data Lake is not optimized for a high number of users or diverse and simultaneous workloads due to intensive query tasks. This can result in performance degradation and failures are common when running extractions, transformations, and loading tasks all at the same time. On-premises Data Lakes face other performance challenges in that they have a static configuration.

Data Hoarding and Storage Capacity

Data stored in Data Lakes may actually never be used in production and stay unused indefinitely in the Data Lake. By storing massive amounts of historical data, the infinite Data Lake may skew analysis with data that is no longer relevant to the priorities of the business. In keeping the historical data the metadata describing it must be understood as well. This decreases the performance of the Data Lake by increasing the overall workload of employees to clean the datasets no longer in use for analysis.

Line 154: Line 154:

Data in a Data Lake lacks standard security protection with a relational database management system or an enterprise database. In practice, this means that the data is unencrypted and lacks access control.. Security is not just a binary solution. We have varying degrees of security (unclassified, secret, top secret, etc.) and all of which require different approaches. This will inevitably present challenges with the successful use of data from Data Lakes.To combat this, organizations will have to embrace a new security framework to be compatable with Data Lakes and Data Scientists.

<h4>Considerations</h4>

−

~~Strategic Resourcing and Network Planning~~

+

Shared Services Canada (SSC) has an excellent opportunity to capitalize on its mandate of providing data storage service to GC’s other departments. SSC, as the GC’s Service Provider, could potentially a centralized GC Data Lake and allow GC Data Scientists access to this central data using a single unified Data Lake interface. However, this is a project which should be implemented after cloud has been adopted and enterprise data centers have been migrated to in order to provide adequate infrastructure and scaling.

−

~~A strategic approach~~ to ~~Kubernetes investments will need~~ to ~~be developed to ensure opportunities are properly leveraged~~. ~~The~~ GC ~~invests~~ a ~~significant portion of its annual budget on IT and supporting infrastructure~~. ~~Without strategic Kubernetes direction the fragmented approaches to IT investments~~, ~~coupled with rapid developing technology~~ and ~~disjointed business practices, can undermine effective~~ and ~~efficient delivery of GC programs and services~~<~~ref~~>~~Treasury Board of Canada Secretariat. December 3, 2018. Directive on Management of Information Technology. Treasury Board of Canada Secretariat. Government of Canada. Retrieved 27-Dec-2018 from:~~ ~~[https://www~~.~~tbs-sct.gc.ca/pol/doc-eng~~.~~aspx?id=15249 ]~~<~~/ref~~>. ~~A clear vision and mandate for how Kubernetes will transform services~~, ~~and~~ what ~~the end-state Kubernetes initiative is~~ supposed to ~~look like, is a prominent consideration~~.

+

Data Lakes should not be confused for conventional databases although they both store information. A Data Lake will always underperform when tasked with the jobs of a conventional database. To combat this, SSC must create data architectures that define the proper application of Data Lakes. Too often, Data Lakes suffer from lack of foresight on what they're supposed to achieve. Creating a Data Lake becomes the goal rather than achieving a strategic objective.

−

~~SSC should consider defining~~ a ~~network strategy for Kubernetes adoption. Multiple factors should be taken into account, including~~ the amount of resources, funding, and expertise that will be required for the development and experimentation with Kubernetes technologies. Calculation of resource requirements including CPU, memory, storage, etc. at the start of Kubernetes projects is imperative. Considerations include whether or not an in-house Kubernetes solution is required or if a ~~solution can be procured. Other strategy considerations include analyzing different orchestration approaches for different application use cases~~.

+

Shared Services Canada (SSC) should consider designing Data Lake infrastructure around Service-Level Agreements (SLA) to keep Data Lake efforts on track. This includes ensuring that SSC has established clear goals for Data Lakes prior to deployment.

−

~~Complexity and Skills Gap~~

+

SSC should also consider building an expert special group focussed on advanced analytics and experimental data trend discovery in Data Lakes. While the fundamental assumption behind the Data Lake concept is that everyone accessing a Data Lake is moderately to highly skilled at data manipulation and analysis, the reality is most are not. SSC should consider significant investment in training employees necessary skills, such as Data Science, Artificial Intelligence, Machine Learning, or Data Engineering.

−

Kubernetes is a good technology and the de facto standard for orchestrating containers, and containers are the future of modern software delivery. But it is notoriously complex to manage for enterprise workloads, where Service Level Agreements (~~SLAs~~) ~~are critical. The operational pain of managing production-grade Kubernetes is further complicated by the industry-wide talent scarcity and skills gap. Most organizations today struggle~~ to ~~hire Kubernetes experts, and even these “experts” lack advanced Kubernetes experience to ensure smooth operations at scale~~. SSC ~~will need~~ to ~~be cautious in implementing Kubernetes and having the right staff experienced and comfortable in its use~~.

+

SSC should be cognisant that there are significant overinflated expectations revolving around Data Lakes. Inflated expectations lead to vague and ambiguous use cases and increased chances of catastrophic failures. As a Service Provider, SSC must be strict in establishing clear goals for Data Lake provision efforts before deployment. SSC, should be wary of attempts to replace strategy development with infrastructure. A Data Lake can be a technology component that supports a data and analytics strategy, but it cannot replace that strategy.

−

~~Customization~~ and ~~Integration Still Required~~

+

SSC should be concerned with the provision and running of the infrastructure, the departments themselves are responsible for the data they put in the Data Lake. However, as a Service Provider, SSC should monitor the Data Lake with regards to data governance, data lifecycle for data hygiene, and what is happening in the Data Lake overall. Depending on technology, SSC will need to be very clear on how to monitor activities in the Data Lakes it provides to the GC.

−

~~Kubernetes technology and ecosystem~~ are ~~evolving rapidly, because of its relatively new state, it is hard~~ to ~~find packaged solutions with complete out-~~of~~-the-box support for complex, large-scale enterprise scenarios~~. As a ~~large and sophisticated enterprise organization~~, SSC ~~will need~~ to ~~devote significant resources on customization and training~~. ~~Enterprise Architecture pros will need to focus on the whole architecture of cloud-native applications as well as keep~~ a ~~close watch on~~ technology ~~evolution~~ and ~~industry~~.

+

SSC should consider a Data Lake implementation project as a way to introduce or reinvigorate a data management program by positioning data management capabilities as a prerequisite for a

−

~~Implementation usually takes longer than expected, however~~ the ~~consensus in~~ the ~~New Stack’s Kubernetes User Experience Survey is that Kubernetes reduces code deployment times~~, ~~and increases~~ the ~~frequency of those deployments<ref>Williams, Alex, et al. The State of~~ the ~~Kubernetes Ecosystem. The New Stack. thenewstack.io~~. ~~Retrieved 15-May-2019 from: [https://thenewstack.io/ebooks/kubernetes/state-of-kubernetes-ecosystem/ ]</ref>.~~ However, ~~in the short run~~, the ~~implementation phase does consume more human resources. Additionally~~, ~~implementation takes longer than expected. The consensus is that Kubernetes reduces code deployment times~~, and ~~increases~~ the ~~frequency of those deployments~~. ~~However~~, in the ~~short run,~~ the ~~implementation phase does consume more human resources~~.

+

successful Data Lake. Data will need to be qualified before it hits the data lake, this can and should be done in a system of record first. In this way the data can be organizedto fit into the Data Lake implementation.

−

~~Pilot Small and Scale Success~~

+

−

SSC ~~may wish to~~ consider ~~evaluating the current Service Catalogue in order~~ to ~~determine where Kubernetes can be leveraged first to improve efficiencies, reduce costs, and reduce administrative burdens of existing services~~ as ~~well as how~~ a ~~new Kubernetes service could be delivered on~~ a ~~consistent basis~~. ~~Any new procurements of devices or platforms should have high market value and can~~ be ~~on-boarded easily onto~~ the ~~GC network. SSC should avoid applying in-house Kubernetes for production mission-critical apps. Failure of in-house deployments is high~~ and ~~thus~~ should be ~~avoided. SSC should pilot and establish~~ a ~~Kubernetes test cluster~~. ~~With all new cloud-based technologies, piloting is preferred. Focus should first~~ be ~~on a narrow set of objectives and a single application scenario to stand up a test cluster~~.

+

SSC should create policies on how data is managed and cleaned in the Data Lake. Automated data governance technologies should be added to support advanced analytics. Standardizing on a specific type of governance tool is an issue which must be resolved. Additionally, planning for effective metadata management, considering metadata discovery, cataloguing and enterprise metadata management applied to Data Lake implementation is vital. Rigorous application of data discipline and data hygiene is needed. To combat this, SSC should use data management tools and create policies on how data is managed and cleaned in the Data Lake. The majority of Data Lake analysts will prefer to work with clean, enriched, and trusted data. However, data quality is relative to the task at hand. Lowquality data may be acceptable for low-impact analysis or distant forecasting, but unacceptable for tactical or high-impact analysis. SSC assessments should take this into account.

−

~~Implement Robust Monitoring, Logging, and Audit Practices and Tools~~

+

Design Data Lakes with the elements necessary to deliver reliable analytical results to a variety of data consumers. The goal is to increase cross-business usage in order to deliver advanced analytical insights. Build Data Lakes for specific business units or analytics applications, rather than try to implement some vague notion of a single enterprise Data Lake. However, alternative architectures, like data hubs, are often better fits for sharing data within an organization.

−

~~Monitoring provides visibility and detailed metrics of Kubernetes infrastructure. This includes granular metrics~~ on ~~usage and performance across all cloud providers or private~~ data ~~centers, regions, servers, networks, storage,~~ and ~~individual VMs or containers~~. ~~Improving~~ data ~~center efficiency and utilization~~ on ~~both on-premises and public cloud resources~~ is ~~the goal~~. Additionally, ~~logging is a complementary function and required capability~~ for effective ~~monitoring~~ is ~~also a goal~~. ~~Logging ensures that logs at every layer~~ of ~~the architecture are all captured for analysis, troubleshooting~~ and ~~diagnosis~~. ~~Centralized~~, ~~distributed, log~~ management and ~~visualization~~ is ~~a key capability<ref>Chemitiganti, Vamsi,~~ and ~~Fray, Peter. (February 20th, 2019). 7 Key Considerations for Kubernetes~~ in ~~Production~~. The ~~New Stack. 2019. Retrieved 16-May-2019 from: [https://thenewstack.io/7-key-considerations-for-kubernetes-in-production/]</ref>. Lastly~~, ~~routine auditing~~, ~~no matter the checks~~ and ~~balances put in place, will cover topics that normal monitoring will not cover~~. ~~Traditionally~~, ~~auditing~~ is ~~as a manual process~~, but ~~the automated tooling in the Kubernetes space is quickly improving~~.

+

<h2>References</h2>

−

~~Security~~

−

~~Security is~~ a ~~critical part~~ of ~~cloud native applications and Kubernetes~~ is ~~no exception~~. ~~Security is a constant throughout the container lifecycle and it is required throughout the design, development, DevOps, and infrastructure choices~~ for ~~container-based~~ applications~~. A range of technology choices are available~~ to ~~cover various areas such as application-level security and the security~~ of ~~the container and infrastructure itself~~. ~~Different tools that provide certification and security for what goes inside the container itself (such as image registry~~, ~~image signing~~, ~~packaging)~~, ~~Common Vulnerability Exposures/Enumeration (CVE) scans, and more<ref>Chemitiganti, Vamsi, and Fray, Peter. (February 20th, 2019). 7 Key Considerations~~ for Kubernetes in Production. The New Stack. 2019. Retrieved 16-May-2019 from: [https://thenewstack.io/7-key-considerations-for-kubernetes-in-production/]</ref>.. SSC will need to ensure appropriate security measures are used with any new Kubernetes initiatives, including the contents of the containers being orchestrated.

−

<h2>References</h2>

</div>

Kpere060

105

edits