CloudSQL to BigQuery Dataflow Pipeline in GCP

Introduction

Moving data between Cloud SQL and BigQuery is fairly straightforward with federated queries. However, federated queries are not available for Cloud SQL instances created with a private IP address, which might be the only option in many organisation due to security constraints. As an alternative, a Dataflow pipeline can be built to do the job. Moreover, there is a template readily available (JDBC to BigQuery) which in an ideal world would have made this approach easy as well. However, there are some bits and pieces which can be not quite obvious. At least, they weren’t for me — I had to spend a few days building a working pipeline and in the end I had to ask a GCP expert for help. In this blog article I’m trying to address these issues to make life easier for other people facing the same challenge. In my example I’m using Postgres Cloud SQL although I would expect the Mysql case to be very similar if not identical.

Prerequisites

  • A Postgres Cloud SQL instance with private IP and a database in it (might require enabling Compute Engine API if not enabled already)
  • A BigQuery dataset and a table in it (for simplicity, make it with just one column X of type NUMERIC)
  • Enabled Dataflow API
  • A storage bucket
  • Maven (either on your local machine or a Cloud Compute Engine) for building the jars

Jar files

The Dataflow template is “JDBC to BigQuery”, not “CloudSQL to BigQuery”. So you’re going to need JDBC drivers for Cloud SQL. You will need jars for Cloud SQL The important part is that you’re going to need both so-called “socket factory JDBC driver” and the actual JDBC driver (e.g. Postgres one if that’s what your Cloud SQL database is).

The “socket factory” jar can be built from the source code available here: https://github.com/GoogleCloudPlatform/cloud-sql-jdbc-socket-factory. Build instructions in the readme file are fairly trivial, you just need Maven installed and then copy and paste the line from readme into your terminal window. The result will appear in the target directory — be sure to pick the jar with required dependencies (mine was called postgres-socket-factory-1.3.1-SNAPSHOT-jar-with-dependencies.jar). The appropriate JDBC driver jar for Postgres can be found at https://jdbc.postgresql.org/download.html.

Once you have the jar files, you’ll need to upload them to your storage bucket using gsutil or web console as you prefer.

Creating the job

At this point, you should be able to go to the Dataflow page in the Cloud console and create the job as shown on the screenshot below.

You might need to click on “Show optional parameters” to be able to specify the username and the password for the database connection.

Details to pay attention to

Coming from the Oracle database background, I sometimes tend to resolve problems by experimenting a little bit, letting error messages to guide me when I’m doing something wrong. In the cloud world, error message aren’t always as helpful, so this approach wouldn’t necessarily be productive, and attention to detail can save you a lot of time. In particular, pay attention to the following:

  1. The JDBC url needs to be correctly formed. The format that worked for me jdbc:postgresql://<database_instance_IP_address>:5432/<database_name>. Interestingly, “Cloud SQL socket factory” documentation mentions two url formats, a brief one and a complete one, but neither one has the form above and neither worked for me
  2. With Postgres, creating an instance doesn’t mean you have a database — don’t get caught out by that. Create the database and use its name in the final portion of the JDBC url (not the instance name)
  3. Don’t forget to include both jars, the socket factory one and the actual driver one. Use the comma to separate their paths as the hint in the template suggests
  4. Be sure to include the dataset name with the BigQuery table name and use the correct format (see the screenshot)
  5. The query in the source database should produce the same schema as the input table in BigQuery. Basically this means that column names must match
  6. Forgetting to specify the database username and password when submitting the job will is likely to cause the job to fail (although in this case at least the error message should be clear enough to make you realise that).

Troubleshooting

If steps above don’t work as expected try checking APIs and permissions, if that’s not it and the error message is not particularly clear — try to look in temp/staging directories specified in job parameters, you might find useful clues there.

Performance monitoring and anomalies

Tech bloggers (myself included) tend to like writing about big issues, with massive impact, company’s reputation (or in extreme cases even very existence) at stake, etc. This adds dramatism to the story (and as an added bonus, helps us feel more important). I think it can also lead to a certain bias, making the IT community believe that only major issues matter. Whenever you spot a relatively harmless anomaly and try to find out more about it on a DBA forum, “is there an actual problem you’re trying to solve” is a very likely response. And if you honestly answer “not really”, you can expect a lecture on “obsessive tuning disorder” and a suggestion to stick to a simple principle, “ain’t broken don’t fix it”. I think this mentality was initially a sound reaction to some inquisitive minds trying to solve non-issues and occasionally creating issues out of nothing instead. When taking too far, however, this attitude becomes dangerous. Anomalies are important even without an immediate impact. Sometimes they are important even without any impact at all. In this post, I’d like to illustrate it with an example.

Continue reading “Performance monitoring and anomalies”

Memory fragmentation via inode cache growth

In my last blog post I covered some details of our recent battle with memory fragmentation problems on an OL6 server (Exadata compute node). It was mostly focused around page cache growth which was the main scenario. However, in addition to that, there was also a secondary scenario that had a completely different mechanism, and I will describe it in this post.

Continue reading “Memory fragmentation via inode cache growth”

Memory fragmentation via buffered file I/O

Last year I’ve spent quite some time tackling various memory fragmentation issues on an Exadata cluster (I’ve described some of my experiences here and here). In the end, everything was resolved, and the symptoms went away, but only to come back in a different form a few months later. First we had a minor episode months ago — some applications experienced timeouts when trying to connect, and there was a massive spike in Load Average metric with hundreds of processes in the “D” state stuck on rtnetlink_rcv and rtnl_lock. Then everything became stable again, until a few weeks ago, when the same symptoms came back, but the impact became much more severe, with node evictions, reboots and failovers causing serious disruption for the application.

Continue reading “Memory fragmentation via buffered file I/O”

Installing R shiny server on OCI compute instance (always free tier)

Introduction

Oracle Compute Infrastructure (OCI) is the next generation cloud platform run by Oracle. Like some of its competitors, Oracle offers an always free tier where you can access a range of cloud products and services without having to pay anything ever. And while you will need to provide credit card information to prove that you are not a bot, it will not going to be charged unless you explicitly upgrade to a paid account.

Continue reading “Installing R shiny server on OCI compute instance (always free tier)”

Slow insert values with array binding after JDBC upgrade to 19c

A very brief note to alert the community of a nasty JDBC bug affecting INSERT performance. It was noticed by our Java developers after upgrading their JDBC driver 11.2.0.3.0 to version 19.3.0.0.190416DBRU. They were inserting data in batches of 5,000 rows at a time, 250,000 total, and the time to process the entire batch went up from 16 to 102 minutes.

Continue reading “Slow insert values with array binding after JDBC upgrade to 19c”

Offloading regex parsing using external table preprocessors

Loading data from flat files into an Oracle database is a very common task. Oracle’s implementation of external tables is fantastic, and in many cases it simplifies the job to such a degree that the developer is left with very little to do: just write a “create table” statement with a few additional details about the file’s structure and that’s pretty much it. Even if the information in the file is not in a simple format (like comma-separated or tab-delimited), this doesn’t make things much more complicated, as you can e.g. load the raw text and then use regex functions to process it.

So I’ve been using this feature in a broad variety of situations (some of them I covered in this blog, e.g. here), and one problem that I occasionally incur is that performance isn’t always great. For example, here is the DDL of what I use to parse listener log files:

Continue reading “Offloading regex parsing using external table preprocessors”