Part I: Data processing pipelines with Spring Cloud Data Flow on Oracle Cloud

Abhishek Gupta
Oracle Developers
Published in
8 min readJan 22, 2018

--

For a master table-of-contents for blog posts on microservice topics, please refer — https://medium.com/oracledevs/bunch-of-microservices-related-blogs-57b5f1f062e5

Asynchronous communication/messaging is a key pattern when it comes to building loosely coupled and scalable apps (including microservices) in the cloud — one of a major class of problems solved by this pattern is real time data processing

What if you could have a platform/framework built specifically for this ? And even better, what if you could deploy & operate it on the cloud ?

This is the first of a two part blog series which deals with Spring Cloud Data Flow on Oracle Cloud

  • Part 1 will give you an introduction and demonstrate how to deploy the Spring Cloud Data Flow server on Oracle Application Container Cloud including other components
  • Part 2 will demonstrate how you can build stream processing pipelines using the foundation you setup in Part 1

To be specific, Part 1 will cover

  • Gentle introduction to Spring Cloud Data Flow
  • the secret sauce — Spring Cloud Data Flow on Oracle Application Container Cloud
  • Infrastructure setup Oracle Event Hub Cloud (Kafka) is used as the messaging middleware and Oracle MySQL Cloud serves as the persistent data store for the Spring Cloud Data Flow setup

Hello “Spring Cloud Data Flow”

Here is a quick intro — please refer to the documentation for details

TL;DR — It is a framework/toolkit for building data processing pipelines

  • Types of apps — long lived stream processing, data integration and short-lived tasks
  • Pipelines are nothing but Spring Boot apps which make use of Spring Cloud Stream or Spring Cloud Task

By the way, here is blog on how to use messaging based microservices with vanilla Spring Cloud Stream & Kafka on Oracle Cloud

  • Event Bus/Middleware — Support for Kafka and Rabbit MQ (we will use Kafka via Oracle Event Hub Cloud)
  • Infrastructure — the pipelines themselves can be deployed to a variety of runtimes (Kubernetes, Mesos etc.) whose implementations are pluggable

One such custom implementation for Oracle Application Container Cloud will be covered in the next section

  • Interfaces — you can use the dashboard (graphical editor), REST API or CLI to work with Spring Cloud Data Flow

Let’s try to understand the solution and its components

Spring Cloud Data Flow on Oracle Application Container Cloud

Here is a high level solution architecture which involves all the components. You will encounter each one of them as you read on…

Solution architecture

Oracle Application Container Cloud has a two-fold role in this case

  • It serves as the platform for running the Spring Cloud Data Flow server itself
  • It also doubles as a runtime for the pipelines which you build using Spring Cloud Data Flow — this is the interesting part!

Spring Cloud Data Flow Server

The server module is a Spring app with an embedded servlet container (e.g. Tomcat). As you will see in the upcoming sections — this can simply be run as a fat JAR on top of the Java SE runtime support in Oracle Application Container Cloud

Spring Cloud Data Flow pipelines

(as mentioned above) the data processing pipelines created using Spring Cloud Data Flow are just Spring Boot apps which need to run somewhere. This runtime portion abstracted in the form of a Spring Cloud Deployer SPI which encapsulates the implementation for a specific runtime e.g. local JVM (of the Data Flow server), Kubernetes, Apache Mesos etc.

Here is a snippet from the Spring Cloud Data Flow documentation which illustrates this concept

Spring Cloud Deployer abstraction for various runtimes

There is a Spring Cloud Deployer implementation specific to Oracle Application Container Cloud as well

Although this is a work in progress and evolving, in it’s current state, it can be used to operate Spring Cloud Data Flow pipelines — you will see it in action in Part 2 of this blog

Message broker: Oracle Event Hub Cloud

The individual pipelines need an underlying messaging layer for asynchronous communication — Spring Cloud Data Flow supports Apache Kafka and Rabbit MQ

We will be using Oracle Event Hub Cloud (Managed Kafka) in this case

Persistent store: Oracle MySQL Cloud

By default, Spring Cloud Data Flow stores all the info in an in-memory database, but it has support for other RDBMSes as well

We will leverage Oracle MySQL Cloud as the persistent data store for the Spring Cloud Data Flow server

Maven repository

Although you will see this in action in part 2 of the blog, for now, its sufficient to understand that Spring Cloud Data Flow uses Maven as one of its sources for the applications which need to be deployed as a part of the pipelines which you build — more details here and here

Infrastructure setup

This section provides a summary of how to setup the infrastructure components required for Spring Cloud Data Flow setup

  • Oracle Event Hub Cloud instance
  • Oracle MySQL Cloud instance

Oracle Event Hub Cloud (Kafka broker)

The Kafka cluster topology used in this case is relatively simple i.e. a single broker with co-located with Zookeeper). You can opt for a topology specific to your needs e.g. HA deployment with 5-node Kafka cluster and 3 Zookeeper nodes

Please refer to the documentation for further details on topology and the detailed installation process (hint: its straightforward!)

Oracle Event Hub instance

Creating custom access rule

You would need to create a custom Access Rule to open port 2181 on the Kafka Server VM on Oracle Event Hub Cloud — details here

Oracle Application Container Cloud does not need port 6667 (Kafka broker) to be opened since the secure connectivity is taken care of by the service binding

Oracle MySQL Cloud

Provision a MySQL database instance — you can refer to the detailed documentation here

Oracle MySQL Cloud instance

Now that we have the infrastructure foundation, its time to deploy the Data Flow server on the cloud

Build & deployment

Build Spring Cloud Dataflow from source

This includes the SPI implementation specific to Oracle Application Container Cloud

This will create spring-cloud-dataflow-server-accs-1.0.0-SNAPSHOT.jar under the spring-cloud-dataflow-server-accs\target folder

Spring Cloud Data Flow server (fat) JAR

Zip it upzip scdf.zip spring-cloud-dataflow-server-accs\target\spring-cloud-dataflow-server-accs-1.0.0-SNAPSHOT.jar

Edit metadata files

Configure the metadata files as per your setup

manifest.json

manifest.json for Data Flow server on ACCS

Here is summary of the key attributes

  • maven.remote-repositories.repo1.url — Maven repository URL. Use maven.remote-repositories.repo1.auth.username and maven.remote-repositories.repo1.auth.password if applicable

We are using the Spring Maven repo in this case. More in part 2

  • spring.cloud.dataflow.applicationProperties.stream.spring.cloud.stream.kafka.binder.brokers — leave this unchanged as it will be picked up from the environment variable
  • spring.cloud.dataflow.applicationProperties.stream.spring.cloud.stream.kafka.binder.zkNode — Enter value for Event Hub Zookeeper host and port

deployment.json

deployment.json for Data Flow server on ACCS

Here is summary of the key attributes

  • ACCS_URL — use emea instead of us in case of Europe data center
  • spring_datasource_username — leave this unchanged as it will be picked up from the environment variable
  • spring_datasource_password — leave this unchanged as it will be picked up from the environment variable
  • spring_datasource_driver-class-name — leave this unchanged

Push to cloud

With Oracle Application Container Cloud, you have multiple options in terms of deploying your applications. This blog will leverage PSM CLI which is a powerful command line interface for managing Oracle Cloud services

other deployment options include REST API, Oracle Developer Cloud and of course the console/UI

Download and setup PSM CLI on your machine (using psm setup) — details here

Deploy the Spring Cloud Dataflow server — psm accs push -n SpringCloudDataflowServer -r java -s hourly -m manifest.json -d deployment.json -p scdf.zip

Once executed, an asynchronous process is kicked off and the CLI returns its Job ID for you to track the application creation

Check your application

Access the application and check the instance and topology information

Application overview

Notice the Service Bindings for Event Hub and MySQL cloud

Service Bindings — Event Hub and MySQL

Test drive

Access the Spring Cloud Data Flow dashboard — navigate to the URL which you see on your application detail screen e.g. https://SpringCloudDataflowServer-mydomain.apaas.us2.oraclecloud.com/dashboard

Change the deployment topology

You can easily modify the topology of your Spring Cloud Data Flow setup

Change the topology as required
  • Scale up/down — Increase/decrease the memory (RAM) allocated to the Data Flow server
  • Scale in/out — increase/decrease the number of instances for HA and performance. This is easy since the persistent state is stored in MySQL and the app itself is stateless

Summary

That’s it for Part I where we covered

  • basic concepts
  • deployed a Spring Cloud Data Flow server on Oracle Application Container Cloud along with its dependent components which included
  • Oracle Event Hub Cloud as the Kafka based messaging layer, and
  • Oracle MySQL Cloud as the persistent RDBMS store

The next installment will demonstrate things in action as we build Kafka based stream processing pipelines using Spring Cloud Data Flow on Oracle Application Container Cloud..

Don’t forget to…

  • check out the tutorials for Oracle Application Container Cloud — there is something for every runtime!
  • other blogs on Application Container Cloud

Cheers!

The views expressed in this post are my own and do not necessarily reflect the views of Oracle.

--

--

Abhishek Gupta
Oracle Developers

Principal Developer Advocate at AWS | I ❤️ Databases, Go, Kubernetes