Switflow pipeline / Part 2 – Dataflow to BigQuery (tabular format)

Switflow pipeline / Part 2 – Dataflow to BigQuery (tabular format)

  • By website_admin
  • Comments Off on Switflow pipeline / Part 2 – Dataflow to BigQuery (tabular format)

dsds

Table of Contents
< All Topics

Goal

Similarly to the previous tutorial Switflow Pipeline / Part 1 – Dataflow To BigQuery (nested format), the goal here is to create and execute a real-time pipeline tailored for exploratory analytics on SWIFT MT using Pub/Sub, DataflowSwiftflow and BigQuery. The difference is the format of the output: a tabular format (flatenned and exploded) while we produced a nested format (JSON string) in the previous tutorial.

At the end of this tutorial, you will run a streaming data pipeline that continuously writes data into BigQuery in a format that is ready for explorative analytics with Standard SQL.

Content

This tutorial shows how to ingest a real-time feed of SWIFT MT messages parsed, enriched and structured via Swiftflow into BigQuery. It covers the following steps:

  • Read a real-time feed of SWIFT MT raw messages (string) from Pub/Sub
  • Transform those messages to enriched JSON objects via a Dataflow job, defined using the Apache Beam SDK in Python, and which sends requests to the Swiftflow API
  • Transform the returned JSON objects as exploded/flattened data and ingest it into BigQuery with dynamic schema evolution
  • Show some examples of SQL queries

Motivation

As discussed in the previous tutorial Switflow Pipeline / Part 1 – Dataflow To BigQuery (nested), the nested format comes with a few limitations (needs precise requirements, leads to query complexity, query performance limitations and limited view of the data).

The proposed approach is holistic. It transforms all the fields found in the message in dedicated columns of a table, while keeping full track of their initial context in the raw message (section, qualifier, tag, ..) plus their position in case they can appear as repeated elements (e.g. list of multiple entities, accounts, …). The entire context information is stored in the column names.

This opens the way to data consumption in Standard SQL and related applications: Business Intelligence tools (usually natively tailored for SQL) or downstream Machine Learning applications, and this on all the message types and all the fields at the same time, which can enable the definition of cross-business transactions.

Pre-requisites

Please run the previous tutorial priori to this one. The following will concentrate on the new features that are built upon it.

Pipeline creation

Schema

The only difference is that there is no schema is required, as it will be derived by BigQuery thanks to the schema auto-detection feature. The table width will dynamically evolve over time as more messages and new fields are ingested in the pipeline, stabilizing at some point once all the variants have been encountered. Therefore, no variable SCHEMA is required.

Swiftflow utils

We need two additional classes here. The class MtToDataframe processes the requests to the Swiftflow endpoint plus the logic used to explode and flatten the resulting JSON into a pandas dataframe by using the flatsplode package1. The second class WriteDataframeToBQ relies on an existing proposal2 and takes care of the BigQuery client initialization and configuration, using the parameterautodetect=True for the schema-free ingestion.

class MtToDataframe(beam.DoFn):
        
    def __init__(self, pipeline_options):
        self.execution = pipeline_options.swiftflow_execution
        self.endpoint = pipeline_options.swiftflow_endpoint
               
     # Comply to BQ column name requirements
     
    def clean_column_names(self,df):
        
        import re
        
        df =  df.rename(lambda x: '_' + x if re.match('([0-9])\w+',x) else x, axis=1)
        df= df.rename(lambda x: x.replace(" ", "_"), axis =1)
        df = df.rename(lambda x: x.replace(".", "_"), axis =1)
        
        return df
        
    def process(self, mtMessage):
        
        import requests as rq # Required to avoid variable conflicts on Dataflow VM
        import pandas as pd
        from flatsplode import flatsplode
        
        # Real request
        if self.execution == "nested":
            
            url = self.endpoint+"/nested"
            response_str = rq.post(url,json.dumps({"data":mtMessage}))
            response_dict = json.loads(response_str.content)
            payload_str = response_dict['data']
        
        # Test request
        elif self.execution == "test":
            
            url = self.endpoint+"/example"
            response_str =  rq.post(url,json.dumps({"data":"testing"}))
            response_dict = json.loads(response_str.content)
            payload_str = response_dict['data']
              
        # Swiftflow not available, falls back to dummy JSON
        else:
            
            payload_str = '{"foo":{"bar":{"name1":"john","name2":"jane"}}}'
        
        # Flatsploding
        
        payload_dict = json.loads(payload_str)
        df = pd.DataFrame(list(flatsplode(payload_dict)))     
        df = self.clean_column_names(df)
        
        yield df

class WriteDataframeToBQ(beam.DoFn):
    
    def __init__(self, bq_dataset, bq_table, project_id):
        self.bq_dataset = bq_dataset
        self.bq_table = bq_table
        self.project_id = project_id

    def start_bundle(self):
        
        from google.cloud import bigquery
        self.client = bigquery.Client()

    def process(self, df):
        
        from google.cloud import bigquery
        
        table_id = f"{self.bq_dataset}.{self.bq_table}"

        job_config = bigquery.LoadJobConfig(
            autodetect=True,
            source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
            schema_update_options=[
                bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION,
                bigquery.SchemaUpdateOption.ALLOW_FIELD_RELAXATION
            ],
            write_disposition=bigquery.WriteDisposition.WRITE_APPEND,
            create_disposition=bigquery.CreateDisposition.CREATE_IF_NEEDED
        )

        try:
            load_job = self.client.load_table_from_dataframe(
                df,
                table_id,
                job_config=job_config
            )

            load_job.result()
            if load_job.errors:
                logging.info(f"result={load_job.error_result}")
                logging.info(f"errors = {load_job.errors}")
            else:
                logging.info(f'Loaded {len(df)} rows.')

        except Exception as error:
            logging.info(f'Error: {error} with loading dataframe')

Pipeline

with beam.Pipeline(options=pipeline_options) as p:    
        
        input_subscription=f"projects/{PROJECT}/subscriptions/{SUBSCRIPTION}"
      
        _ = (
          
            p
            | "Read from Pub/Sub" >> beam.io.ReadFromPubSub(
                      subscription=input_subscription).with_output_types(bytes)
            | "Conversion UTF-8 bytes to string" >> beam.Map(lambda msg: msg.decode('utf-8'))
            | "Transformation MT to Dataframe" >> beam.ParDo(MtToDataframe(pipeline_options))
            | "Write Raw Data to Big Query" >> beam.ParDo(
                      WriteDataframeToBQ(project_id=PROJECT, bq_dataset=DATASET, bq_table=TABLE))
         
            )

Pipeline execution

Please refer to the previous tutorial: Switflow Pipeline / Part 1 – Dataflow To BigQuery (nested)

Analytics

Securities – MT541

Raw

That is the raw message as received from the SWIFT network.

MT_541 = "{1:F01ABNACHZ8XXXX2596106739}{2:O5411345160418ICBKCNBJXBJM00897254971604181345N}{3:{108:110110149709}}{4:\r\n:16R:GENL\r\n:20C::SEME//1234567890123456\r\n:23G:NEWM\r\n:98C::PREP//20181123165256\r\n:16S:GENL\r\n:16R:TRADDET\r\n:98A::TRAD//20181123\r\n:98A::SETT//20181127\r\n:35B:ISIN CH0012138530\r\nCREDIT SUISSE GROUP\r\n:16S:TRADDET\r\n:16R:FIAC\r\n:36B::SETT//UNIT/10,\r\n:97A::SAFE//0123-1234567-05-001\r\n:94F::SAFE//NCSD/INSECHZZXXX\r\n:16S:FIAC\r\n:16R:SETDET\r\n:22F::SETR//TRAD\r\n:16R:SETPRTY\r\n:95R::DEAG/SCOM/CH123456\r\n:16S:SETPRTY\r\n:16R:SETPRTY\r\n:95P::SELL//ABCDABABXXX\r\n:97A::SAFE//123456789\r\n:16S:SETPRTY\r\n:16R:SETPRTY\r\n:95P::PSET//INSECHZZ\r\n:16S:SETPRTY\r\n:16R:AMT\r\n:19A::SETT//CHF218,4\r\n:16S:AMT\r\n:16S:SETDET\r\n-}"
BigQuery/JSON

The below view shows the message transformed by Swiftflow and ingested as a tabular structure into the table.

BigQuery/Extract

A powerful feature of the approach is that one simple SELECT (*) statement gives you a full structured list of all fields found in that message.

SELECT * FROM`swiftflow-pipeline-poc.tutorial.mt_flat`

We can observe the effect of:

  • The flattening – expresses all fields in their context (section, tag, business name) as flat list of columns
  • The explosion – creates as many records as the number of combinations of repetitive elements

The latter feature allows to get all attributes belonging to the specific entities in separated columns. For example and because the context is preserved from the original message, it is possible to deduce that the value Account Number = 123456789 belongs to Entity 2 in the below picture.

Payments – MT910

Similar comments are valid for the payment messages as well: all fields are going to be extracted and no more up-front definition is required before starting the extraction.

Next steps

Analytics

At this stage, data is ready for:

  • Direct analysis within BigQuery using standard SQL
  • Business intelligence (BI) with any third-party tool as tabular format and SQL compatibility being the standard
  • Business intelligence (BI) and dashboarding via Data Studio (button “Explore data”). Note that Data Studio is a free tool.
  • File export as (button “Save results”)

Machine learning

Possible next steps are:

  • To use BigQuery Machine Learning (ML) which lets you create and execute machine learning models using standard SQL queries.
  • Develop Machine Learning application using the Python client for Big Query

Indeed, the tabular holistic format is the preferred approach for machine learning development and potentially its execution as it allows to explore and select from all the features (depending on the feature selection during the training phase) hence avoiding data loss and performance limitations of the models.

Sources

  1. https://pypi.org/project/flatsplode/
  2. https://medium.com/inside-league/loading-complex-json-files-in-realtime-to-bigquery-from-pubsub-using-dataflow-and-updating-the-49b420838bac