Data Engineering for Streaming Data on GCP

Rosario0g3nio
2 min readSep 28, 2022

In this post I’ll give a very short explanation on how a data stream pipeline looks like on GCP.

The main advantage of using a data stream pipeline is the fact that you are able to analyze the data in real time, which is fundamental when data driven decision making has to be done as soon as possible.

To build a real time data solution with Google Cloud, there 3 main steps to take into account and the end-to-end architecture looks as described bellow:

  1. Ingest streaming data using Pub/Sub (Pub/Sub stands for publisher/Subscriber and it handles distributed Message Oriented Architectures at scale).Pub/Sub main characteristics:

• Ensures at-least-once delivery;
• No provisioning is required;
• APIs are open;
• Globally by default;
• Offers end-to-end encryption.

2. Process data with dataflow

3. Visualize the results with google data studio and Looker.

The data ingest step is the the early stages in a data pipeline.

Streaming data from every types of sources (ex: mobile devices, autonomous cars, gaming consoles…) is ingested into the Pub/Sub.

Pub Sub reads, stores and broadcasts to any sub of this data topic that new messages are available.

As a subscriber of pub/sub, Dataflow can ingest and transform those messages in an elastic streaming pipeline and output the results into an paralytics data warehouse like big query.

And finally, you can connect into a data visualization tool like Data Studio to visualize and monitor the results of the pipeline or an ML tool to explore the data to extract useful insights and make data driven decision or predictive models around it.

--

--

Rosario0g3nio

Just exploring the world of ML and Deep Learning and sharing my journey! Might also write about startups, SaaS and SE in general.