The Cloudera foundation is built upon the Apache Hadoop framework and employs the largest group of committers under one roof. Cloudera enables organizations to capture, store, analyze and act on any data at massive speed and scale in a single data solution using Hadoop platforms.
The Cloudera foundation is built upon the Apache Hadoop framework and employs the largest group of committers under one roof. Cloudera enables organizations to capture, store, analyze and act on any data at massive speed and scale in a single data solution using Hadoop platforms.
Cloudera is being agnostic to hardware and our solutions can be optimized for both the Cloud and on-premises environments. As a result, Cloudera has a vast partner ecosystem and we pride ourselves on our solutions being highly compatible with our Customers’ existing environment and service providers. This allows for our solution to be molded to environments for a custom experience rather than wasting time and resources introducing solutions that are not compatible with the pre-existing hardware, environment or service providers that are already in place, leading to any budget being vastly depleted even before the proposed solution is installed.
Your goals to modernize the legacy systems and better harness your data is the mission we at Cloudera share. We strive to bring a comprehensive solution-set of data analytics to data anywhere the enterprise needs to work, from the Edge to AI.
By implementing an open source data platform supported by Cloudera on your own infrastructure, in the cloud or a hybrid of both, we expect you can achieve the following core benefits as we enable your Data Lake:
- New Efficiencies for data architecture through a significantly lower cost storage platform by leveraging the industry’s only secure enterprise-ready open source Hadoop distribution. A modern data architecture will allow you to integrate, store and process all enterprise data regardless of source, format, and type at a fraction of the cost of proprietary solutions.
- Capture Data in Motion in a secure, traceable way to un-tap the potential of streaming data analytics, data routing and overall seamless data ingestion from Dubai Municipality owned, or public data sources.
- New Opportunities, Innovation & Insights by providing data scientists, business analysts, and data developers with the ability to easily access and query all enterprise data within one environment from batch to real time using the tools they are most familiar with.
Building Futuristic Platform (AI-ML, Big data and BI)
Cloudera offers an Enetrprise Supported Full Data Lifecycle
From autonomous vehicles, to surgical robots to churn prevention and fraud detection, enterprises rely on data to uncover new insights and power world-changing solutions. It all starts with a data platform that enables you to say “yes”.
- Yes, to the analytics your people want to use.
- Yes, to operating on any cloud your business requires.
- Yes, to the future with a cloud-native platform that flexes to meets your needs today and tomorrow. And we have delivered.
Cloudera Data Platform is the industry’s first enterprise data cloud:
- Multi-function analytics on a unified platform that eliminate silos & speed the discovery of data-driven insights.
- A shared data experience that applies consistent security, governance, and metadata
- True hybrid capability with support for public cloud, multi-cloud, Private Cloud, & on-premise deployments.
Cloudera Shared Data Experience (SDX)
CDX is the security and governance fabric that binds the enterprise data cloud. SDX enables data and metadata security and governance policies to be set once and automatically enforced across data analytics in hybrid and multi-clouds. Unlike standalone analytics software solutions or cloud services, Cloudera Data Platform with SDX delivers powerful enterprise-wide controls over data and metadata, anywhere, for ultimate infrastructure and business flexibility.
Cloudera Data Platform (CDP)
CDP is an easy, fast, and secure enterprise analytics and management platform with the following capabilities:
- Enables ingesting, managing, and delivering of any analytics workload from Edge to AI.
- Provides enterprise grade security and governance.
- Provides self-service access to integrated, multi-function analytics on centrally managed and secured business data.
- Provides a consistent experience on Public Cloud, Multi-Cloud, and Private Cloud deployments
CDP powers data-driven decision making by easily, quickly, and safely connecting and securing the entire data lifecycle. For this, data moves through a lifecycle in five distinct phases.
CDP gives you complete visibility into all your data with no blind spots. The CDP control plane allows you to manage the data, infrastructure, analytics, and analytic workloads across hybrid and multi-cloud environments all with Cloudera shared experience or SDX providing consistent security and governance across the entire data lifecycle. You can manage and secure the data lifecycle in any cloud and data center with CDP.
CDP enables you to:
- Automatically spin up workloads when needed and suspend their operation when complete thereby controlling the cloud costs.
- Optimize workloads based on analytics and machine learning.
- View data lineage across any cloud and transient clusters
- Use a single pane of glass across hybrid and multi-clouds.
- Scale to petabytes of data and 1,000s of diverse users
- Centrally control customer and operational data across multi-cloud and hybrid environment
Cloudera CDP provides a unified platform to cost-effectively collect, store and manage unlimited volumes of any structured, semi-structured and unstructured data.
Cloudera’s Enterprise Data Hub (EDH) consists of
- CDP (Cloudera’s Distribution including Hadoop)
- Cloudera’s Enterprise Management, Governance and Security layer.
- Cloudera’s DataFlow (CDF)
The above diagram explain how we deliver Develop risk management operation model, to ensure risk profiling process move from federal level to local level and vice versa without disturbing the trade in local customs department.
Data Lake as the
single point of truth
CDP is 100% Apache-licensed open source and offers unified batch processing, interactive SQL, and interactive search, and role-based access controls. More enterprises have downloaded CDP than all other such distributions combined. CDP includes the core elements of Apache Hadoop plus several additional key open-source projects that, when coupled with customer support, management, and governance through a Cloudera Enterprise subscription, can deliver an enterprise data hub.
- Flexible – Store any type of data and prosecute it with an array of different computation frameworks including batch processing, interactive SQL, free text search, machine learning & statistical computation.
- Integrated – Get up and running quickly on a complete, packaged, Hadoop platform.
- Secure – Process and control sensitive data and facilitate multi-tenancy.
- Scalable & Extensible – Enable a broad range of applications and scale them with your business.
- Highly Available – Run mission-critical workloads with confidence.
- Compatible – Extend and leverage existing IT investments.
A unified control plane to manage infrastructure, data, and analytic workloads across hybrid that already spans data and compute in on-premises HDFS (and public clouds if needed), as well as allowing FCA to implement future use cases flexibly and easily that makes delivering use cases enabling services to business as smooth as possible and in matters of minutes.
Consistent data security, governance and control that safeguards data privacy, regulatory compliance, and prevents cybersecurity threats across environments.
CDP is the industry’s first enterprise data platform. CDP delivers powerful self-service analytics across hybrid, on-premise and multi-cloud environments, along with sophisticated and granular security & governance policies that data leaders demand.
Delivered as a private cloud service, CDP includes: Machine Learning services as well as a Data Hub service for building custom business applications powered by our new Cloudera Runtime open-source distribution. Services (Machine Learning here) all accessing redhat OpenShift containerization and Kubernetes platform providing elastic capabilities to run 100’s of workloads all enabled for automatic workload performance adaptation and auto-scale in-place for each single workload
We are pleased to submit the following information to the FCA. Our solution foundation is built upon the Apache Hadoop framework and employs the largest group of committers under one roof. We enable organizations like FCA to capture, store, analyze and act on any data at massive speed and scale in a single data solution using Hadoop platforms., we pride ourselves on being agnostic to hardware and our solutions can be optimized for both the Cloud and on-premises environments. As a result, we have a vast partner ecosystem, and we pride ourselves on our solutions being highly compatible with our Customers’ existing environment and service providers. This allows for our solution to be molded to your environment for a custom experience rather than FCA wasting time and resources introducing solutions that are not compatible with the pre-existing hardware, environment or service providers that are already in place, leading to any budget being vastly depleted even before the proposed solution is installed. Your goals to modernize the legacy systems and better harness your data is the mission we at Palmira share. We strive to bring a comprehensive solution-set of data analytics to data anywhere the enterprise needs to work, from the Edge to AI.
By implementing an open-source data platform supported by Cloudera on your own infrastructure, in the cloud or a hybrid of both, we expect FCA can achieve the following core benefits as we enable your Data Lake:
- New Efficiencies for data architecture through a significantly lower cost storage platform by leveraging the industry’s only secure enterprise-ready open-source Hadoop distribution. A modern data architecture will allow you to integrate, store and process all enterprise data regardless of source, format, and type at a fraction of the cost of proprietary solutions.
- Capture Data in Motion in a secure, traceable way to un-tap the potential of streaming data analytics, data routing and overall seamless data ingestion from [insert Client] owned, or public data sources.
- New Opportunities, Innovation & Insights by providing data scientists, business analysts, and data developers with the ability to easily access and query all enterprise data within one.
After thoroughly studying FCA’s requirements, we recommend utilizing Cloudera for Big data and analytics and Tableau for BI requirements.
CDSW (Cloudera Data Science Workbench):
CDSW, the unique web-based GUI collaborative development tool for data scientists is already proposed as a tightly integrated component with Cloudera Platform. CDSW provides a secure environment for data scientists that integrates with Cloudera platform and provides them with access to full data available on Hadoop cluster. To avoid re-inventing the wheel, all security policies attached to Hadoop users can be inherited with few clicks on CDSW environments providing a seamless experience for data scientists. CDSW provides a flexibility in developing in multiple languages (python, Scala & R) with different versions supported for each, while a single user can have multiple projects across different languages with different permissions and privileges assigned to different users list inherited from LDAP directory integrated already with Hadoop.
Cloudera Data Science Workbench is a web application that allows data scientists to use their favorite open-source libraries and languages — including R, Python, and Scala — directly in secure environments, accelerating analytics projects from exploration to production.
Built using container technology, Cloudera Data Science Workbench offers data science teams per project isolation and reproducibility, in addition to easier collaboration. It supports full authentication and access controls against data in the cluster, including complete, zero-effort Kerberos integration which means full, tight and seamless integration of Cloudera EDH users and security configuration. Add it to an existing cluster, and it just works. With Cloudera Data Science Workbench, data scientists can:
- Use R, Python, or Scala on the cluster from a web browser, with no desktop footprint.
- Install any library or framework within isolated project environments.
- Directly access data in secure clusters with Spark and Impala.
- Share insights with their team for reproducible, collaborative research.
Automate and monitor data pipelines using built-in job scheduling.
Cloudera Data Flow:
Cloudera DataFlow (CDF) is a scalable, real-time streaming data platform that collects, curates, and analyzes data so customers gain key insights for immediately actionable intelligence. It meets the challenges faced with data-in-motion, such as real-time stream processing, data provenance, and data ingestion from IoT devices and other streaming sources. Built on 100% open-source technology, CDF helps you deliver a better customer experience, boost your operational efficiency and stay ahead of the competition across all your strategic digital initiatives. CDF is very similar to HDF, and the foundations are the same. CDF is available on EDH while HDF is only available for HDP.
Recently, the whole rhythm and nature of organizations’ needs are taking different angle and started to have special requirements that go beyond just dealing with traditional and legacy systems. Use cases are becoming more complex, systems are siloed, and data sources are becoming unpredictable from nature and data format perspective. The need for batch data streaming is definitely a need for organizations to acquire data from transactional and operational systems; but it’s now a need more than any time before; to have streaming and real-time data streaming. Cloudera has been already a pioneer in the streaming area by adopting and integrating latest and most sophisticated streaming technologies with Cloudera Data Platform.
Cloudera Data Flow (CDF) is a complete portfolio that provides best of breed technologies in data in motion, streaming, IoT and data ingestion field. Based on FCA requirements, we are proposing data messaging and streaming technologies as part of this proposal. However, for any future use cases that might have IoT requirements; it is easy to procure and integrate any of the CDF portfolio to achieve the required functionalities.
Cloudera Flow Management (Apache NiFi)
Cloudera Flow Management (CFM) is a no-code data ingestion, movement and management solution powered by Apache NiFi. With NiFi’s intuitive graphical interface and 300+ processors, CFM delivers highly scalable data movement, transformation and management capabilities to the enterprise.
Apache NiFi is meant for large scale; high velocity enterprise data ingestion use cases. Primarily meant for Realtime streaming sources such as clickstreams, social streams, log data etc., Apache NiFi can handle all types of data across any type of data source. NiFi Registry, which augments NiFi, enables DevOps teams with versioning, deployment and development of flow applications.
Only red-rectangled components are proposed per CLIENT NAME requirements
Apache NiFi has an intuitive user interface for designing data flow orchestrations to for acquiring, processing and routing data from any source to any target. This is accomplished with a no-code approach to designing these flows by dragging-and-dropping pre-built processors onto the canvas and connecting them up.
Apache NiFi provides the following unique features and capabilities:
- Intuitive visual design tool
- Flow templates
- Guaranteed delivery.
- Prioritized queuing
- Flow Specific QoS (latency v throughput, loss tolerance, etc.)
- Data Provenance
- Comprehensive security (Authentication and Authorization)
- Extensible architecture
- Site-to-site communication protocol
- Flexible scaling model
- Parametrization for seamless deployment of flows
- Stateless NiFi execution mode for extremely high performance
Cloudera Stream Processing (Apache Kafka) is a streaming platform that provide the following capabilities:
- High throughput and low latency: Kafka support hundreds of thousands of messages per-second, with latencies as low as a few milliseconds.
- Scalability: A Kafka cluster can be elastically and transparently expanded without
- Durability and reliability: Messages are persisted on disk and replicated within the cluster to prevent data loss.
- Fault tolerance: The platform is immune to machine failure in the Kafka cluster.
- High concurrency: Ability to simultaneously handle thousands of diverse clients,
- simultaneously writing to and reading from Kafka.
- Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system.
- Process streams of records as they occur.
Through its integration with CDF and hence CDP, you can build complete workloads within a single platform. Only Cloudera provides simple deployment and robust troubleshooting and monitoring of Kafka, as well as shared compliance-ready security, governance, lineage, and control in one simple application across multiple on-premises, hybrid, private, public, or multi-cloud environments.
Kafka Streams is the built-in stream processing library of the Apache Kafka project and provides real-time stream processing and analytics with high throughput and very low latency. It is a good fit if you are developing solely within a Kafka-to-Kafka pipeline, you don’t need or want another cluster for stream processing and analytics in the future, and operational and resilience requirements are simple or handled elsewhere. Kafka Streams enables you to perform common stream processing functions like filtering, joins, aggregations, and enrichments on the data stream. Good use cases include building lightweight microservices, straight forward ETL jobs, and simple stream analytics apps. Its mainly used for Building real-time streaming data pipelines and Building real-time streaming applications that transform or react to the streams of data.
Kafka Cruise Control
Kafka Cruise Control enables you to manage and load balance large Kafka installations. It is the solution for platform teams that need first class management services that address hard.
problems such as frequent hardware/virtual machine failures, cluster expansion/reduction, and load skew among brokers. It solves these challenges by balancing cluster intelligently and with automated anomaly detection and remediation.
While it automatically balances partitions based on user defined goals, Kafka Cruise Control also detects and actively addresses anomalies. For example, if there is a broker failure, Kafka. Cruise Control will fix the cluster by removing the failed brokers. In the case of disc failure, all the offline replicas will be moved to healthy brokers. Kafka Cruise Control is a very important component of CDF since it provides the foundation for first class Kafka Cloud Workload
is an important component of the Cloudera Kafka ecosystem because it enables your teams to safely mitigate interruptions that occur due to schema mismatches. It manages, shares, and supports the evolution of all producer and consumer schemas across the Kafka landscape. You can also avoid having to attach a schema to every piece of data.
As part of CDF’s streams messaging capabilities, Schema Registry provides a shared repository of schemas that allows applications to flexibly interact with each other across the Kafka landscape by using the same schemas from end-to-end. This is particularly useful for managing data flows with schema-based routing. For example, parsing a syslog event to extract the event type, and then based on that type, route it to a downstream Kafka topic.
The screenshot below shows how you would use the Schema Registry UI to create schema groups, schema metadata, and add schema versions.
Figure: Cloudera Schema Registry GUI
Apache Kafka formed out of multiple components that makes it the highest throughput messaging platform:
A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it. For each topic, the Kafka cluster maintains a partitioned log that looks like this:
Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.
The Kafka cluster durably persists all published records—whether or not they have been consumed—using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka’s performance is effectively constant with respect to data size so storing data for a long time is not a problem.
Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load, or it can be done according to some semantic partition function (say based on some key in the record).
Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.
If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.
If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.
- TIP: Cloudera Stream Processing Microservices
Cloudera Data Platform supports Microservices since early days and without the need for data platform presence as these Microservices are highly decoupled and independently scalable. Please refer to the following blog for full details on Microservices:
Open-Source Dedication and Innovation
Cloudera is dedicated to the Kafka ecosystem and continues to be actively involved with the Kafka open-source community through deep engineering relationships with other Kafka committers. This relationship has led to critical innovations and product improvements, many of which have been described here. SMM (Streams Messaging Manager) was developed because Kafka does not inherently have a user interface and so, IT teams across the enterprise. struggled to understand what went on within their Kafka clusters. We created this unified toolset as a response to what our Kafka customer most needed.
Streams Replication Manager, which is directly incorporated into Streams Messaging Manager, is another example of best-in-class engineering and innovation. We improved upon the original Kafka open-source messaging replication tool by infusing the concepts of clusters, global configuration, and global management APIs. The result is a comprehensive Kafka replication platform that not only guarantees high availability and durability across large Kafka architectures but also enables a number of other business critical use cases such as geo proximity and cloud migrations.
Figure: Cloudera Shared Data Experience
Multi Cloud and Hybrid Cloud Support
CDP is the world’s first enterprise data cloud and thus we are able to help our customers support streaming architectures that must retain an on-premises footprint for sensitive applications but, for the rest, need to leverage the cost efficiencies of public cloud providers. All of the Kafka ecosystem components can be instantly provisioned into your on-premise, private or favorite public cloud while leveraging the unified data security, governance, lineage, and control provided through CDP’s Shared Data Experience.
Security, Data Governance, and Data Lineage are First Class Citizens
The most important part of a large Kafka ecosystem is how it is pulled together, and CDP’s Shared Data Experience (SDX) is by far the biggest differentiator when compared to other Kafka platform providers. This is because data security, control policies, governance, and lineage are set once and automatically enforced on every data platform and across all components of your streaming architecture.
High Level Platform Abstraction
Another important aspect to leave off with is that the integration of CDF and CDP provides a unified platform that handles the complexities of connecting, managing, and integrating the tenets of flow management, streams messaging, and stream processing and analytics through a high level of abstraction. This empowers the business teams to suffice their share of the responsibility of delivering, in near real-time, the innovative products and services that their customers, employees, and regulators expect. Cloudera is superior because we provide a whole Kafka ecosystem that is greater than the sum of its parts.
Streams Messaging Manager (SMM)
Probably the most striking component of the Cloudera Kafka ecosystem is Cloudera Streams Messaging Manager (SMM) because it provides so much power across so many teams. SMM is a single monitoring/management dashboard that provides end-to-end visibility into how data moves across Kafka clusters between producers, brokers, topics, and consumers. It is a complete Kafka toolset that addresses the unique needs of DevOps, application development, platform operations, governance, and security teams.
Platform Operations and DevOps teams need the ability to create alerts to manage the service level agreement (SLA) of their applications. SMM provides rich alert management features for the critical components of a Kafka cluster including brokers, topics, consumers, and producers by making use of two key constructs.
- Alert Notifier: An alert notifier tells SMM what to do when a configured alert is triggered. Out-of-the-box notifiers include sending alerts to a configured email inbox, an HTTP endpoint, or a Kafka topic to integrate alerts with other systems used across the enterprise (e.g: ticketing/ case creation systems). The user is also able to configure custom alert notifiers.
- Alert Policy: An alert can be defined for any Kafka entity: cluster, broker, topic, producer, or consumer. A set of metrics can be selected to define a series of simple alerts while conditional operators can be used to compose complex alerts that monitor a variety of metrics across a number of entities. The alert policy is also configured with the notifier (above) when the alert fires.
As an example, the image below shows interactive visualizations that enable you to fully understand how data flows across Kafka clusters.
End-to-End Kafka Visualization
Figure: Cloudera Streams Messaging Manager Interface
Cloudera Streams Messaging
Cloudera delivers the most comprehensive streams messaging and management capabilities in the industry. It includes:
- Latest certified, secure, and governed Apache Kafka that provides the messaging backbone.
- Schema Registry for centralized schema management
- Kafka Streams for real-time analytics
- Kafka Connect for native connectivity with key data sources.
- Cruise Control for cluster management and monitoring
- Apache Ranger for rich access control and security
- Streams Messaging Manager for monitoring and management of enterprise Kafka
- Streams Replication Manager for disaster recovery and replication of enterprise Kafka clusters
- Process millions of messages per second with Apache Kafka
- Adopt a hybrid cloud architecture for your streaming needs across any public cloud Re-use schemas, define relationship between schemas, and manage schema versions with Schema Registry.
- Leverage the integration of Schema Registry across Kafka and Apache NiFi by using the same schemas from end-to-end.
- Optimize and auto-scale your clusters with Cruise Control Cure “Kafka Blindness” by getting visibility into all your Kafka clusters with Streams Messaging Manager
- Manage enterprise Kafka data effectively for active-active cluster replication and disaster recovery use cases.
Extend Monitoring/Management Capabilities with REST
The user interface is powered by first class REST services and all SMM capabilities are exposed as REST endpoints, making the product completely extensible. This is a developer and DevOps friendly way to integrate with other enterprise tools such as application performance monitoring and case/ticketing systems.
Track Data Lineage and Governance from Edge-To-Enterprise
Like other integrated components of the Cloudera DataFlow platform, SMM enjoys SDX’s unified data security and governance from edge environments across to your enterprise’s data center and cloud platforms. This includes Ranger for security and Apache Atlas for end-to-end data governance. With that, you have access to the metadata and metrics about every Kafka topic and can produce complete data lineage and audit trails, even across multiple Kafka hops.
The example below shows how a user can drill down from an edge sensor consumer (1) and launch a data lineage diagram (2) to directly see related flows across Kafka topics (3).
Figure: Kafka topics to Atlas Lineage
Integration with Schema Registry
Schema Registry, another key component of CDP, has been integrated with SMM, providing the ability to view, create and modify the schema associated with any given Kafka topic. It allows the user to define schemas for a given Kafka topic and provides the following key benefits:
- Data Governance:
Provide reusable schema (centralized registry), define relationships between schemas (version management), and enable generic format conversion and generic routing (schema validation).
- Operational Efficiency:
Avoid attaching schemas to every piece of data (centralized registry), enable consumers and producers to evolve at different rates (version management), and ensure data quality (schema validation).
- Topic Lifecycle Management
SMM enables users to create, update and delete topics directly through the user interface as well as via REST services. Topics can be created as a function of availability characteristics (replication factor, minimum in-sync replicas, etc.) or with custom settings. These operations are fully integrated with Kafka Ranger policies such that only authorized users can perform these topic lifecycle management actions.
Cloudera Streaming Analytics (CSA) offers real-time stream processing and streaming analytics powered by Apache Flink. Flink implemented on CDP provides a flexible streaming solution with low latency that can scale to large throughput and state. Additionally, to Flink, CSA includes SQL Stream Builder to offer data analytical experience using SQL queries on your data streams. Key features of Cloudera Streaming Analytics
- SQL Stream Builder
- SQL Stream Builder is a job management interface to compose and execute Streaming SQL on streams, as well as to create durable data APIs for the results.
Implementing Flink on the Cloudera Platform allows you to easily integrate with Runtime components and have all the advantages of cluster and service management with Cloudera Manager.
For streaming analytics, CSA fits into a complete streaming platform augmented by Apache Kafka, Schema Registry, Streams Messaging Manager in the Cloudera Runtime stack.
CSA offers Kafka, HBase, HDFS, Kudu and Hive as connectors to choose based on the requirements of your application deployment.
Within CSA, Kafka Metrics Reporter, Streams Messaging Manager and the reworked Flink Dashboard helps you monitor and troubleshoot your Flink applications.
The log aggregation framework and job tester framework in CSA also enables you to create more reliable Flink applications for production.
Cloudera Machine Learning Overview
Machine learning has become one of the most critical capabilities for modern businesses to grow and stay competitive today. From automating internal processes to optimizing the design, creation, and marketing processes behind virtually every product consumed, ML models have permeated almost every aspect of our work and personal lives. Cloudera Machine Learning (CML) is Cloudera’s new cloud-native machine learning service, built for CDP. The CML service provisions clusters, also known as ML workspaces, that run natively on Kubernetes. Each ML workspace enable teams of data scientists to develop, test, train, and ultimately deploy machine learning models for building predictive applications all on the data under management within the enterprise data cloud. ML workspaces are ephemeral, allowing you to create and delete them on-demand. ML workspaces support fully-containerized execution of Python, R, Scala, and Spark workloads through flexible and extensible engines.
Seamless portability across private cloud, public cloud, and hybrid cloud powered by Kubernetes Rapid cloud provisioning and autoscaling Fully containerized workloads – including Python, R, and Spark-on-Kubernetes – for scale-out data engineering and machine learning with seamless distributed dependency management High performance deep learning with distributed GPU scheduling and training Secure data access across HDFS, cloud object stores, and external databases.
The cloud offers many advantages for unpredictable and heterogeneous workloads, but there are two challenges:
- data is often spread across multiple clouds and on-premises systems, and
- existing products only cover parts of the machine learning lifecycle.
Cloudera Machine Learning directly addresses both these issues. It’s built for the agility and power of cloud computing, but isn’t limited to any one provider or data source. And it is a comprehensive platform to collaboratively build and deploy machine learning capabilities at scale. CML gives you the power to transform your business with machine learning and AI. CML users are: Data management and data science executives at large enterprises who want to empower teams to develop and deploy machine learning at scale. Data scientist developers (use open source languages like Python, R, Scala) who want fast access to compute and corporate data, the ability to work collaboratively and share, and an agile path to production model deployment. IT architects and administrators who need a scalable platform to enable data scientists in the face of shifting cloud strategies while maintaining security, governance and compliance. They can easily provision environments and enable resource scaling so they – and the teams they support – can spend less time on infrastructure and more time on innovation.