

And this is a whole lot of operational or configurations as well as overhead when it comes to monitoring. In Airflow, a job is called a DAG, or a directed acyclic graph. The second problem is that because the way we use Airflow, is that we're creating one job for every single table. But then we're getting to this inconsistency where an analyst that may be trying to do a join in BigQuery, and one of the tables is being uploaded on an hourly or daily basis, and another table is being uploaded every 15 minutes, and then the data becomes inconsistent, so it's like, why is it not in this other table but it's here? Some of our jobs, we try to push the limit to once every 15 minutes, so the job runs in 15-minute intervals.

The data won't actually arrive into BigQuery until much later. So the first problem, which ties back to the talk from the introduction, is that it has very high latency. With this approach though, we're starting to hit a lot of limitations and operational overhead. The way we detect these changes is by looking at the modified time column in each table, and if the modified time has been changed in the most recent interval, we upload that information into BigQuery. And the way we're using Airflow is basically by periodically pulling the MySQL database for a change. For anyone who hasn't heard about Airflow, you can think of it as cron on steroids, that's the line for data pipelines and complex workflow. And we use Airflow as a tool to orchestrate our data pipelines. Most of our microservices are stateful and the states are typically stored into a MySQL database. So at WePay, we use a microservice architecture. And that's another feature that we're leveraging very heavily at WePay for our streaming pipeline which we'll also going into later on. And this will allow you to access real-time data even through views. And because these views are now materialized, when you're querying the view, you're essentially querying the underlying table. And it has a virtual view feature which you can create views on top of the base tables. It supports nested and repeated data structures for things like lists or structs and even geospatial data types, which is actually something very useful for CDC as you will see later on. It uses ANSI-compliant SQL as its core language which makes it really easy for developers and engineers to pick up. It's basically the Google's cloud data warehouse. For those of you who are in the AWS lens, this is the equivalent of RedShift. And finally, we are going to go a little experimental and take a look at some of the ongoing work we're doing with streaming Cassandra into BigQuery which is our data warehouse as I mentioned. Next, we're going to take a look at a real-world example which is how we're actually streaming data from MySQL into our data warehouse. We're also going to introduce change data capture, which is the mechanism that we use to stream data from our database. We're first going to go over what our current ETL or our previous ETL process looked like and what are some of the pain points that we're going through. The talk is going to be broken down into three sections. This talk is about our journey at WePay going from an ETL data pipeline into a streaming based real-time pipeline. We should allow everything to be streamed in real-time so we can access these data as soon as they arrive into the database. I want to argue that the data in your data warehouses should not be considered as second class citizen. We live in a world where we expect everything to be streamed, like our music is streamed. For this talk, I'm going to be talking about database streaming. If you haven't heard about WePay, we provide payment solutions for platform businesses through our API.
