Difference between revisions of "Technology Trends/Datalakes"
Line 136: | Line 136: | ||
<h4>Challenges</h4> | <h4>Challenges</h4> | ||
− | <p> | + | <p>Although Data Lake technology has many benefits for organizations dealing with big data it has its own challenges. For example:</p> |
− | <p> | + | <b>Data Governance and Semantic Issues</b> |
− | <p> | + | <p class="expand inline mw-collapsible-content">The biggest challenge for Data Lakes is to resolve assorted data governance requirements in a single centralized data platform. Data Lakes fail mostly when they lack governance, self-disciplined users, and a rational data flow.</p><p class="inline">Often, Data Lake implementations are focused on storing data instead of managing the data. Data Lakes are not optimized for semantic enforcement or consistency. They are made for semantic flexibility, to allow anyone to provide context to data if they have the skills to do so. </p> |
− | <p> | + | <p>Putting data in the same place does not remove it’s ambiguity or meaning. Data Lakes provide unconstrained, “no compromises” storage model environment without the data governance assurances common to data warehouses or data marts. Proper meta data is essential for a Data Lake, without appropriate meta data the Data Lake will not work as intended. It is beneficial to think of meta data as the fish finder in the Data Lake.</p> |
− | <p> | + | <b class="expand mw-collapsible-content">Lack of Quality and Trust in Data</b> |
− | <p> | + | <p class="expand mw-collapsible-content">Data quality and trust in the data is a perennial issue for many organizations. Although data discovery tools can apply Machine Learning across related datasets from multiple data sources to identify anomalies (incorrect values, missing values, duplicates and outdated data), quality and trustworthiness of data continue to be an issue for Data Lakes who can easily become data dumping grounds. Some data is more accurate than others. This can present a real problem for anyone using multiple data sets and making decisions based upon analysis conducted with data of varying degrees of quality.</p> |
− | <p> | + | <b>Data Swamps, Performance, and Flexibility Challenges</b> |
− | + | <p class="expand inline mw-collapsible-content">Data stored in Data Lakes can sometimes become muddy when good data is mixed with bad data. Data Lake infrastructure is meant to store and process large amounts of data, usually in massive data files. </p><p class="inline">. A Data Lake is not optimized for a high number of users or diverse and simultaneous workloads due to intensive query tasks. This can result in performance degradation and failures are common when running extractions, transformations, and loading tasks all at the same time. On-premises Data Lakes face other performance challenges in that they have a static configuration. </p> | |
− | + | <b class="expand mw-collapsible-content">Data Hoarding and Storage Capacity</b> | |
− | + | <p class="expand mw-collapsible-content">Data stored in Data Lakes may actually never be used in production and stay unused indefinitely in the Data Lake. By storing massive amounts of historical data, the infinite Data Lake may skew analysis with data that is no longer relevant to the priorities of the business. In keeping the historical data the metadata describing it must be understood as well. This decreases the performance of the Data Lake by increasing the overall workload of employees to clean the datasets no longer in use for analysis.</p> | |
+ | <p class="expand mw-collapsible-content">Storing increasingly massive amounts of data for an unlimited time will also lead to scalability and cost challenges. Scalability challenges are less of a risk in public cloud environments, but cost remains a factor. On-premises Data Lakes are more susceptible to cost challenges. This is because their cluster nodes require all three dimensions of computing (storage, memory and processing). Organizations of all kinds generate massive amounts of data (including meta data) and it is increasing exponentially.</p> | ||
+ | <p>The storage capacity of all this data (and future data) will be an ongoing challenge and one that will require constant management. While Data Lakes can and will be stored on the cloud, SSC as cloud broker for the GC will need to provide the appropriate infrastructure and scalability to clients.</p> | ||
+ | <b class="expand mw-collapsible-content">Advanced Users Required</b> | ||
+ | <p class="expand mw-collapsible-content">Data Lakes are not a platform to be explored by everyone. Data Lakes present an unrefined view of data that usually only the most highly skilled analysts are able to explore and engage in data refinement independent of any other formal system-of-record such as a data warehouse. </p> | ||
+ | <p class="expand mw-collapsible-content">Not just anyone in an organization is data-literate enough to derive value from large amounts of raw or uncurated data. The reality is only a handful of staff are skilled enough to navigate a Data Lake. Since Data Lakes store raw data their business value is entirely determined by the skills of Data Lake users. These skills are often lacking in an organization.</p> | ||
+ | <b>Data Security</b> | ||
+ | <p class="inline"></p>Data in a Data Lake lacks standard security protection with a relational database management system or an enterprise database. In practice, this means that the data is unencrypted and lacks access control.<p class="expand inline mw-collapsible-content">. Security is not just a binary solution. We have varying degrees of security (unclassified, secret, top secret, etc.) and all of which require different approaches. This will inevitably present challenges with the successful use of data from Data Lakes.To combat this, organizations will have to embrace a new security framework to be compatable with Data Lakes and Data Scientists.</p> | ||
<h4>Considerations</h4> | <h4>Considerations</h4> | ||
<b>Strategic Resourcing and Network Planning</b> | <b>Strategic Resourcing and Network Planning</b> |
Revision as of 13:35, 18 July 2019
|
|||||||
---|---|---|---|---|---|---|---|
200px | |||||||
Status | Translation | ||||||
Initial release | May 5, 2019 | ||||||
Latest version | July 18, 2019 | ||||||
Official publication | Kubernetes.pdf | ||||||
|
Datalakes is a central system or repository of data that is stored in its natural/raw format. A datalake acts as a single store for all enterprise data. Data is transformed using machine learning, advanced,analytics, and visualization. Several forms of data can be hosed in a datalake. These include structured data from relational databases, unstructured data, semi-structured data, and binary data.
Business Brief
In an ever-increasing hyperconnected world, corporations and businesses are struggling to deal with the responsibilities of storage, management and quick availability of raw data. To break these data challenges down further:
- Data comes in many different structures.
- Unstructured
- Semi-Sturctured
- Structured
- Data comes from many disparate sources.
- Enterprise Applications
- Raw Files
- Operation and Security Logs
- Financial Transactions
- Internet of Things (IoT) Devices and Network Sensors
- Websites
- Scientific Research
- Data sources are often geographically distributed to multiple locations
- Datacenters
- Remote Offices
- Mobile Devices
In an effort to resolve these data challenges, a new way of managing data was created which drove data oriented companies to invent a new data storage mechanism called a Data Lake.
Data Lakes are essentially a technology platform for holding data. Their value to the business is only realized when applying data science skills to the lake.
To summarize, usecases for Data Lakes are still being discovered. Cloud providers are making it easier to procure Data Lakes and today Data Lakes are primarily used by Research Institutions, Financial Services, Telecom, Media, Retail, Manufacturing, Healthcare, Pharma, Oi l& Gas and Governments.
Technology Brief
A Data Lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.”
The DAaaS model enables users to self-serve their data and analytic needs. Users browse the lake’s data catalog (a Datapedia) to find and select the available data and fill a metaphorical “shopping cart” (effectively an analytics sandbox) with data to work with. Once access is provisioned, users can use the analytics tools of their choice to develop models and gain insights. Subsequently, users can publish analytical models or push refined or transformed data back into the Data Lake to share with the larger community.
Industry Usage
There are a variety of ways Data Lakes are being used in the industry:
Ingestion of semi-structured and unstructured data sources (aka big data)such as equipment readings, telemetry data, logs, streaming data, and so forth. A Data Lake is a great solution for storing IoT (Internet of Things) type of data which has traditionally been more difficult to store, and can support near real-time analysis. Optionally, you can also add structured data (i.e., extracted from a relational data source) to a Data Lake if your objective is a single repository of all data to be available via the lake.
Experimental analysis of data before its value or purpose has been fully defined. Agility is important for every business these days, so a Data Lake can play an important role in "proof of value" type of situations because of the "ELT" approach discussed above.
Advanced analytics support. A Data Lake is useful for data scientists and analysts to provision and experiment with data.
Archival and historical data storage. Sometimes data is used infrequently, but does need to be available for analysis. A Data Lake strategy can be very valuable to support an active archive strategy.
Distributed processing capabilities associated with a logical data warehouse.
Canadian Government Use
In 2019, the Treasury Board of Canada Secretariat (TBS), partnered with Shared Services Canada and other departments, to identify a business lead to develop a Data Lake (a repository of raw data) service strategy so that the GC can take advantage of big data and market innovation to foster better analytics and promote horizontal data-sharing.
Data Lake development in the GC is a more recent initiative. This is mainly due to the GC focussing resources on the implementation of cloud initiatives. However, there are some GC departments engaged in developing Data Lake environments in tandem to cloud initiatives.
Implications for Government Agencies
Value Proposition
Data Lakes can provide data to consumers more quickly by offering data in a more raw and easily accessible form. Data is stored in its native form with little to no processing, it is optimized to store vast amounts of data in their native formats. By allowing the data to remain in its native format, a much timelier stream of data is available for unlimited queries and analysis. A Data Lake can help data consumers bypass strict data retrieval and data structured applications such as a data warehouse and/or data mart. This has the effect of improving a business’ data flexibility.
A Data Lake is not a single source of truth. A Data Lake is a central location in which data converges from all data sources and is stored, regardless of the data formatting.
The Data Lake architectural approach can store large volumes of data, this can be a way in which cross-cutting teams can pool their data in a central location and by complementing their systems of record with systems of insight.
Having structured, unstructured and semistructured data, usually in the same data set, can contain business, predictive, and prescriptive insights previously not possible from a structured platform as observed in data warehouses and data marts.
Challenges
Although Data Lake technology has many benefits for organizations dealing with big data it has its own challenges. For example:
Data Governance and Semantic Issues
Often, Data Lake implementations are focused on storing data instead of managing the data. Data Lakes are not optimized for semantic enforcement or consistency. They are made for semantic flexibility, to allow anyone to provide context to data if they have the skills to do so.
Putting data in the same place does not remove it’s ambiguity or meaning. Data Lakes provide unconstrained, “no compromises” storage model environment without the data governance assurances common to data warehouses or data marts. Proper meta data is essential for a Data Lake, without appropriate meta data the Data Lake will not work as intended. It is beneficial to think of meta data as the fish finder in the Data Lake.
Lack of Quality and Trust in Data
Data Swamps, Performance, and Flexibility Challenges
. A Data Lake is not optimized for a high number of users or diverse and simultaneous workloads due to intensive query tasks. This can result in performance degradation and failures are common when running extractions, transformations, and loading tasks all at the same time. On-premises Data Lakes face other performance challenges in that they have a static configuration.
Data Hoarding and Storage Capacity
The storage capacity of all this data (and future data) will be an ongoing challenge and one that will require constant management. While Data Lakes can and will be stored on the cloud, SSC as cloud broker for the GC will need to provide the appropriate infrastructure and scalability to clients.
Advanced Users Required
Data SecurityData in a Data Lake lacks standard security protection with a relational database management system or an enterprise database. In practice, this means that the data is unencrypted and lacks access control.
Considerations
Strategic Resourcing and Network Planning
A strategic approach to Kubernetes investments will need to be developed to ensure opportunities are properly leveraged. The GC invests a significant portion of its annual budget on IT and supporting infrastructure. Without strategic Kubernetes direction the fragmented approaches to IT investments, coupled with rapid developing technology and disjointed business practices, can undermine effective and efficient delivery of GC programs and services[1]. A clear vision and mandate for how Kubernetes will transform services, and what the end-state Kubernetes initiative is supposed to look like, is a prominent consideration.
SSC should consider defining a network strategy for Kubernetes adoption. Multiple factors should be taken into account, including the amount of resources, funding, and expertise that will be required for the development and experimentation with Kubernetes technologies. Calculation of resource requirements including CPU, memory, storage, etc. at the start of Kubernetes projects is imperative. Considerations include whether or not an in-house Kubernetes solution is required or if a solution can be procured. Other strategy considerations include analyzing different orchestration approaches for different application use cases.
Complexity and Skills Gap
Kubernetes is a good technology and the de facto standard for orchestrating containers, and containers are the future of modern software delivery. But it is notoriously complex to manage for enterprise workloads, where Service Level Agreements (SLAs) are critical. The operational pain of managing production-grade Kubernetes is further complicated by the industry-wide talent scarcity and skills gap. Most organizations today struggle to hire Kubernetes experts, and even these “experts” lack advanced Kubernetes experience to ensure smooth operations at scale. SSC will need to be cautious in implementing Kubernetes and having the right staff experienced and comfortable in its use.
Customization and Integration Still Required
Kubernetes technology and ecosystem are evolving rapidly, because of its relatively new state, it is hard to find packaged solutions with complete out-of-the-box support for complex, large-scale enterprise scenarios. As a large and sophisticated enterprise organization, SSC will need to devote significant resources on customization and training. Enterprise Architecture pros will need to focus on the whole architecture of cloud-native applications as well as keep a close watch on technology evolution and industry.
Implementation usually takes longer than expected, however the consensus in the New Stack’s Kubernetes User Experience Survey is that Kubernetes reduces code deployment times, and increases the frequency of those deployments[2]. However, in the short run, the implementation phase does consume more human resources. Additionally, implementation takes longer than expected. The consensus is that Kubernetes reduces code deployment times, and increases the frequency of those deployments. However, in the short run, the implementation phase does consume more human resources.
Pilot Small and Scale Success
SSC may wish to consider evaluating the current Service Catalogue in order to determine where Kubernetes can be leveraged first to improve efficiencies, reduce costs, and reduce administrative burdens of existing services as well as how a new Kubernetes service could be delivered on a consistent basis. Any new procurements of devices or platforms should have high market value and can be on-boarded easily onto the GC network. SSC should avoid applying in-house Kubernetes for production mission-critical apps. Failure of in-house deployments is high and thus should be avoided. SSC should pilot and establish a Kubernetes test cluster. With all new cloud-based technologies, piloting is preferred. Focus should first be on a narrow set of objectives and a single application scenario to stand up a test cluster.
Implement Robust Monitoring, Logging, and Audit Practices and Tools
Monitoring provides visibility and detailed metrics of Kubernetes infrastructure. This includes granular metrics on usage and performance across all cloud providers or private data centers, regions, servers, networks, storage, and individual VMs or containers. Improving data center efficiency and utilization on both on-premises and public cloud resources is the goal. Additionally, logging is a complementary function and required capability for effective monitoring is also a goal. Logging ensures that logs at every layer of the architecture are all captured for analysis, troubleshooting and diagnosis. Centralized, distributed, log management and visualization is a key capability[3]. Lastly, routine auditing, no matter the checks and balances put in place, will cover topics that normal monitoring will not cover. Traditionally, auditing is as a manual process, but the automated tooling in the Kubernetes space is quickly improving.
Security
Security is a critical part of cloud native applications and Kubernetes is no exception. Security is a constant throughout the container lifecycle and it is required throughout the design, development, DevOps, and infrastructure choices for container-based applications. A range of technology choices are available to cover various areas such as application-level security and the security of the container and infrastructure itself. Different tools that provide certification and security for what goes inside the container itself (such as image registry, image signing, packaging), Common Vulnerability Exposures/Enumeration (CVE) scans, and more[4].. SSC will need to ensure appropriate security measures are used with any new Kubernetes initiatives, including the contents of the containers being orchestrated.
References
- ↑ Treasury Board of Canada Secretariat. December 3, 2018. Directive on Management of Information Technology. Treasury Board of Canada Secretariat. Government of Canada. Retrieved 27-Dec-2018 from: [2]
- ↑ Williams, Alex, et al. The State of the Kubernetes Ecosystem. The New Stack. thenewstack.io. Retrieved 15-May-2019 from: [3]
- ↑ Chemitiganti, Vamsi, and Fray, Peter. (February 20th, 2019). 7 Key Considerations for Kubernetes in Production. The New Stack. 2019. Retrieved 16-May-2019 from: [4]
- ↑ Chemitiganti, Vamsi, and Fray, Peter. (February 20th, 2019). 7 Key Considerations for Kubernetes in Production. The New Stack. 2019. Retrieved 16-May-2019 from: [5]