Line 143: |
Line 143: |
| <p class="expand mw-collapsible-content">Data quality and trust in the data is a perennial issue for many organizations. Although data discovery tools can apply Machine Learning across related datasets from multiple data sources to identify anomalies (incorrect values, missing values, duplicates and outdated data), quality and trustworthiness of data continue to be an issue for Data Lakes who can easily become data dumping grounds. Some data is more accurate than others. This can present a real problem for anyone using multiple data sets and making decisions based upon analysis conducted with data of varying degrees of quality.</p> | | <p class="expand mw-collapsible-content">Data quality and trust in the data is a perennial issue for many organizations. Although data discovery tools can apply Machine Learning across related datasets from multiple data sources to identify anomalies (incorrect values, missing values, duplicates and outdated data), quality and trustworthiness of data continue to be an issue for Data Lakes who can easily become data dumping grounds. Some data is more accurate than others. This can present a real problem for anyone using multiple data sets and making decisions based upon analysis conducted with data of varying degrees of quality.</p> |
| <b>Data Swamps, Performance, and Flexibility Challenges</b> | | <b>Data Swamps, Performance, and Flexibility Challenges</b> |
− | <p class="expand inline mw-collapsible-content">Data stored in Data Lakes can sometimes become muddy when good data is mixed with bad data. Data Lake infrastructure is meant to store and process large amounts of data, usually in massive data files. </p><p class="inline">. A Data Lake is not optimized for a high number of users or diverse and simultaneous workloads due to intensive query tasks. This can result in performance degradation and failures are common when running extractions, transformations, and loading tasks all at the same time. On-premises Data Lakes face other performance challenges in that they have a static configuration. </p> | + | <p class="expand inline mw-collapsible-content">Data stored in Data Lakes can sometimes become muddy when good data is mixed with bad data. Data Lake infrastructure is meant to store and process large amounts of data, usually in massive data files. </p><p class="inline">A Data Lake is not optimized for a high number of users or diverse and simultaneous workloads due to intensive query tasks. This can result in performance degradation and failures are common when running extractions, transformations, and loading tasks all at the same time. On-premises Data Lakes face other performance challenges in that they have a static configuration. </p> |
| <b class="expand mw-collapsible-content">Data Hoarding and Storage Capacity</b> | | <b class="expand mw-collapsible-content">Data Hoarding and Storage Capacity</b> |
| <p class="expand mw-collapsible-content">Data stored in Data Lakes may actually never be used in production and stay unused indefinitely in the Data Lake. By storing massive amounts of historical data, the infinite Data Lake may skew analysis with data that is no longer relevant to the priorities of the business. In keeping the historical data the metadata describing it must be understood as well. This decreases the performance of the Data Lake by increasing the overall workload of employees to clean the datasets no longer in use for analysis.</p> | | <p class="expand mw-collapsible-content">Data stored in Data Lakes may actually never be used in production and stay unused indefinitely in the Data Lake. By storing massive amounts of historical data, the infinite Data Lake may skew analysis with data that is no longer relevant to the priorities of the business. In keeping the historical data the metadata describing it must be understood as well. This decreases the performance of the Data Lake by increasing the overall workload of employees to clean the datasets no longer in use for analysis.</p> |
Line 154: |
Line 154: |
| <p class="inline"></p>Data in a Data Lake lacks standard security protection with a relational database management system or an enterprise database. In practice, this means that the data is unencrypted and lacks access control.<p class="expand inline mw-collapsible-content">. Security is not just a binary solution. We have varying degrees of security (unclassified, secret, top secret, etc.) and all of which require different approaches. This will inevitably present challenges with the successful use of data from Data Lakes.To combat this, organizations will have to embrace a new security framework to be compatable with Data Lakes and Data Scientists.</p> | | <p class="inline"></p>Data in a Data Lake lacks standard security protection with a relational database management system or an enterprise database. In practice, this means that the data is unencrypted and lacks access control.<p class="expand inline mw-collapsible-content">. Security is not just a binary solution. We have varying degrees of security (unclassified, secret, top secret, etc.) and all of which require different approaches. This will inevitably present challenges with the successful use of data from Data Lakes.To combat this, organizations will have to embrace a new security framework to be compatable with Data Lakes and Data Scientists.</p> |
| <h4>Considerations</h4> | | <h4>Considerations</h4> |
− | <b>Strategic Resourcing and Network Planning</b>
| + | <p class="expand mw-collapsible-content">Shared Services Canada (SSC) has an excellent opportunity to capitalize on its mandate of providing data storage service to GC’s other departments. SSC, as the GC’s Service Provider, could potentially a centralized GC Data Lake and allow GC Data Scientists access to this central data using a single unified Data Lake interface. However, this is a project which should be implemented after cloud has been adopted and enterprise data centers have been migrated to in order to provide adequate infrastructure and scaling.</p> |
− | <p>A strategic approach to Kubernetes investments will need to be developed to ensure opportunities are properly leveraged. The GC invests a significant portion of its annual budget on IT and supporting infrastructure. Without strategic Kubernetes direction the fragmented approaches to IT investments, coupled with rapid developing technology and disjointed business practices, can undermine effective and efficient delivery of GC programs and services<ref>Treasury Board of Canada Secretariat. December 3, 2018. Directive on Management of Information Technology. Treasury Board of Canada Secretariat. Government of Canada. Retrieved 27-Dec-2018 from: <i>[https://www.tbs-sct.gc.ca/pol/doc-eng.aspx?id=15249 ]</i></ref>. A clear vision and mandate for how Kubernetes will transform services, and what the end-state Kubernetes initiative is supposed to look like, is a prominent consideration. </p> | + | <p class="inline">Data Lakes should not be confused for conventional databases although they both store information. A Data Lake will always underperform when tasked with the jobs of a conventional database. </p><p class="expand inline mw-collapsible-content">To combat this, SSC must create data architectures that define the proper application of Data Lakes. Too often, Data Lakes suffer from lack of foresight on what they're supposed to achieve. </p><p class="inline">Creating a Data Lake becomes the goal rather than achieving a strategic objective. </p><p class="expand inline mw-collapsible-content"></p> |
− | <p>SSC should consider defining a network strategy for Kubernetes adoption. Multiple factors should be taken into account, including the amount of resources, funding, and expertise that will be required for the development and experimentation with Kubernetes technologies. Calculation of resource requirements including CPU, memory, storage, etc. at the start of Kubernetes projects is imperative. Considerations include whether or not an in-house Kubernetes solution is required or if a solution can be procured. Other strategy considerations include analyzing different orchestration approaches for different application use cases.</p> | + | <p class="expand mw-collapsible-content">Shared Services Canada (SSC) should consider designing Data Lake infrastructure around Service-Level Agreements (SLA) to keep Data Lake efforts on track. This includes ensuring that SSC has established clear goals for Data Lakes prior to deployment. </p> |
− | <b>Complexity and Skills Gap</b> | + | <p class="expand mw-collapsible-content">SSC should also consider building an expert special group focussed on advanced analytics and experimental data trend discovery in Data Lakes. While the fundamental assumption behind the Data Lake concept is that everyone accessing a Data Lake is moderately to highly skilled at data manipulation and analysis, the reality is most are not. SSC should consider significant investment in training employees necessary skills, such as Data Science, Artificial Intelligence, Machine Learning, or Data Engineering.</p> |
− | <p>Kubernetes is a good technology and the de facto standard for orchestrating containers, and containers are the future of modern software delivery. But it is notoriously complex to manage for enterprise workloads, where Service Level Agreements (SLAs) are critical. The operational pain of managing production-grade Kubernetes is further complicated by the industry-wide talent scarcity and skills gap. Most organizations today struggle to hire Kubernetes experts, and even these “experts” lack advanced Kubernetes experience to ensure smooth operations at scale. SSC will need to be cautious in implementing Kubernetes and having the right staff experienced and comfortable in its use.</p> | + | <p>SSC should be cognisant that there are significant overinflated expectations revolving around Data Lakes. Inflated expectations lead to vague and ambiguous use cases and increased chances of catastrophic failures. As a Service Provider, SSC must be strict in establishing clear goals for Data Lake provision efforts before deployment. SSC, should be wary of attempts to replace strategy development with infrastructure. A Data Lake can be a technology component that supports a data and analytics strategy, but it cannot replace that strategy.</p> |
− | <b>Customization and Integration Still Required</b> | + | <p class="expand mw-collapsible-content">SSC should be concerned with the provision and running of the infrastructure, the departments themselves are responsible for the data they put in the Data Lake. However, as a Service Provider, SSC should monitor the Data Lake with regards to data governance, data lifecycle for data hygiene, and what is happening in the Data Lake overall. Depending on technology, SSC will need to be very clear on how to monitor activities in the Data Lakes it provides to the GC. </p> |
− | <p>Kubernetes technology and ecosystem are evolving rapidly, because of its relatively new state, it is hard to find packaged solutions with complete out-of-the-box support for complex, large-scale enterprise scenarios. As a large and sophisticated enterprise organization, SSC will need to devote significant resources on customization and training. Enterprise Architecture pros will need to focus on the whole architecture of cloud-native applications as well as keep a close watch on technology evolution and industry. </p> | + | <p class="expand mw-collapsible-content">SSC should consider a Data Lake implementation project as a way to introduce or reinvigorate a data management program by positioning data management capabilities as a prerequisite for a |
− | <p>Implementation usually takes longer than expected, however the consensus in the New Stack’s Kubernetes User Experience Survey is that Kubernetes reduces code deployment times, and increases the frequency of those deployments<ref>Williams, Alex, et al. The State of the Kubernetes Ecosystem. The New Stack. thenewstack.io. Retrieved 15-May-2019 from: <i>[https://thenewstack.io/ebooks/kubernetes/state-of-kubernetes-ecosystem/ ]</i></ref>. However, in the short run, the implementation phase does consume more human resources. Additionally, implementation takes longer than expected. The consensus is that Kubernetes reduces code deployment times, and increases the frequency of those deployments. However, in the short run, the implementation phase does consume more human resources.</p> | + | successful Data Lake. Data will need to be qualified before it hits the data lake, this can and should be done in a system of record first. In this way the data can be organizedto fit into the Data Lake implementation. |
− | <b>Pilot Small and Scale Success</b>
| + | </p> |
− | <p>SSC may wish to consider evaluating the current Service Catalogue in order to determine where Kubernetes can be leveraged first to improve efficiencies, reduce costs, and reduce administrative burdens of existing services as well as how a new Kubernetes service could be delivered on a consistent basis. Any new procurements of devices or platforms should have high market value and can be on-boarded easily onto the GC network. SSC should avoid applying in-house Kubernetes for production mission-critical apps. Failure of in-house deployments is high and thus should be avoided. SSC should pilot and establish a Kubernetes test cluster. With all new cloud-based technologies, piloting is preferred. Focus should first be on a narrow set of objectives and a single application scenario to stand up a test cluster.</p> | + | <p class="expand mw-collapsible-content">SSC should create policies on how data is managed and cleaned in the Data Lake. Automated data governance technologies should be added to support advanced analytics. Standardizing on a specific type of governance tool is an issue which must be resolved. Additionally, planning for effective metadata management, considering metadata discovery, cataloguing and enterprise metadata management applied to Data Lake implementation is vital. Rigorous application of data discipline and data hygiene is needed. To combat this, SSC should use data management tools and create policies on how data is managed and cleaned in the Data Lake. The majority of Data Lake analysts will prefer to work with clean, enriched, and trusted data. However, data quality is relative to the task at hand. Lowquality data may be acceptable for low-impact analysis or distant forecasting, but unacceptable for tactical or high-impact analysis. SSC assessments should take this into account.</p> |
− | <b>Implement Robust Monitoring, Logging, and Audit Practices and Tools</b>
| + | <p>Design Data Lakes with the elements necessary to deliver reliable analytical results to a variety of data consumers. The goal is to increase cross-business usage in order to deliver advanced analytical insights. Build Data Lakes for specific business units or analytics applications, rather than try to implement some vague notion of a single enterprise Data Lake. However, alternative architectures, like data hubs, are often better fits for sharing data within an organization.</p> |
− | <p>Monitoring provides visibility and detailed metrics of Kubernetes infrastructure. This includes granular metrics on usage and performance across all cloud providers or private data centers, regions, servers, networks, storage, and individual VMs or containers. Improving data center efficiency and utilization on both on-premises and public cloud resources is the goal. Additionally, logging is a complementary function and required capability for effective monitoring is also a goal. Logging ensures that logs at every layer of the architecture are all captured for analysis, troubleshooting and diagnosis. Centralized, distributed, log management and visualization is a key capability<ref>Chemitiganti, Vamsi, and Fray, Peter. (February 20th, 2019). 7 Key Considerations for Kubernetes in Production. The New Stack. 2019. Retrieved 16-May-2019 from: <i>[https://thenewstack.io/7-key-considerations-for-kubernetes-in-production/]</i></ref>. Lastly, routine auditing, no matter the checks and balances put in place, will cover topics that normal monitoring will not cover. Traditionally, auditing is as a manual process, but the automated tooling in the Kubernetes space is quickly improving.</p> | + | <h2>References</h2> |
− | <b>Security</b>
| |
− | <p>Security is a critical part of cloud native applications and Kubernetes is no exception. Security is a constant throughout the container lifecycle and it is required throughout the design, development, DevOps, and infrastructure choices for container-based applications. A range of technology choices are available to cover various areas such as application-level security and the security of the container and infrastructure itself. Different tools that provide certification and security for what goes inside the container itself (such as image registry, image signing, packaging), Common Vulnerability Exposures/Enumeration (CVE) scans, and more<ref>Chemitiganti, Vamsi, and Fray, Peter. (February 20th, 2019). 7 Key Considerations for Kubernetes in Production. The New Stack. 2019. Retrieved 16-May-2019 from: <i>[https://thenewstack.io/7-key-considerations-for-kubernetes-in-production/]</i></ref>.. SSC will need to ensure appropriate security measures are used with any new Kubernetes initiatives, including the contents of the containers being orchestrated.</p> | |
− | | |
− | <h2>References</h2>
| |
| | | |
| </div> | | </div> |