Real-Time Analytics Pipeline: Pub/Sub, Dataflow, and BigQuery in Action

Modern data teams don’t just need answers — they need answers as events happen.

In this guide, we walk through a scalable, referenceable, production-ready pipeline for real-time and interactive analytics leveraging Google Cloud Pub/Sub for real-time ingestion, Apache Beam on Dataflow for stream processing, and BigQuery for storage and querying. This architecture delivers streaming insights with sub-second availability. Flexibility is built in: for simpler use cases, teams can implement a simplified, serverless architecture, skipping Dataflow entirely and streaming data directly from Pub/Sub to BigQuery using native subscriptions, enabling immediate availability for querying.

Architecture Overview

At a high level, this pipeline supports two common configurations:

Option 1: Standard Architecture (with Dataflow)

[ Event Producers ] → [ Pub/Sub ]

↓

[ Dataflow Pipeline ]

↓

[ BigQuery Tables ]

Option 2: Simplified (Serverless) Architecture

[ Event Producers ] → [ Pub/Sub ]

↓

[ BigQuery Tables ]

Step 1: Ingest Events with Pub/Sub

Pub/Sub serves as the real-time ingestion layer for event streams. As a globally available service, it is a fully managed, serverless messaging service that provides high availability through synchronous, cross-zone message replication, ensuring reliable delivery at any scale. Built on the universally solid message queuing pattern, it enables high-speed, asynchronous data ingestion, effectively decoupling event producers from downstream processing. It supports at-least-once message delivery, and offers optional per-key in-order delivery. Pub/Sub features native integrations with Google Cloud services, making it an ideal entry point for real-time pipelines feeding systems like Dataflow for stream processing. And BigQuery for streaming ingestion. Security is built-in with fine-grained access controls and always-on encryption.

$ gcloud pubsub topics create events-stream

$ gcloud pubsub subscriptions create events-sub \

--topic=events-stream \

--enable-message-ordering

Best Practices

Use dead-letter topics for failed or malformed messages
Enable message ordering if sequence matters
Set message retention to 7 days for replay support

Step 2: (Optional) Transform Events with Dataflow (Apache Beam)

For teams that require capabilities like event enrichment, normalization (leveraging transformations), or windowed aggregations (utilizing state and time capabilities and features like FixedWindows) on their event streams, Google Cloud Dataflow, which is a fully managed streaming platform built on the open source Apache Beam SDK, offers full control over these complex stream processing tasks. Dataflow, as a fully managed service, enables scalable ETL pipelines, real-time stream analytics, real-time ML, and complex data transformations using Apache Beam's unified model for both batch and streaming processing.

Here's an example in Python:

import apache_beam as beam

import json

def transform_event(event):

event['event_type'] = event['event_type'].lower()

return event

with beam.Pipeline(options=...) as p:

(

| 'ReadFromPubSub' >> beam.io.ReadFromPubSub(subscription='...')

| 'ParseJSON' >> beam.Map(json.loads)

| 'Window' >> beam.WindowInto(beam.window.FixedWindows(60))

| 'Transform' >> beam.Map(transform_event)

| 'WriteToBigQuery' >> beam.io.WriteToBigQuery(

table='project.dataset.events',

schema='event_type:STRING,event_time:TIMESTAMP,...',

write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND

)

Dataflow Tips

Use FixedWindows for real-time rollups
Apply AllowedLateness to catch delayed events

Leverage schema-aware PTransforms for clarity and validation

Step 3: Stream into and Query from BigQuery

Whether Dataflow is processing your data before reaching its destination or being streamed directly from Pub/Sub using native Pub/Sub BigQuery subscriptions, BigQuery supports streaming inserts with sub-second availability. This means the ingested data is immediately available for querying, making BigQuery ideal for powering live dashboards or real-time alerting systems that require insights from the freshest data.

SELECT event_type

, COUNT(*) AS event_count

, TIMESTAMP_TRUNC(event_time, MINUTE) AS bucket

FROM `project.dataset.events`

WHERE 1=1

and event_time >= TIMESTAMP_SUB(

CURRENT_TIMESTAMP(), INTERVAL 1 HOUR

)

GROUP BY bucket, event_type

ORDER BY bucket DESC

Schema Best Practices

Partition by event_time
Cluster by user_id, event_type, or other high-cardinality fields

IAM & Governance

Use least-privilege roles and clearly scoped service accounts:

Security Enhancements

Enable CMEK for encryption
Use VPC-SC to restrict access perimeter
Turn on audit logs for full traceability

Lessons from Production

Hot Keys: Uneven distribution can throttle Dataflow workers.
Backpressure: Always implement retries and backoff at the producer side.
Late Events: Use allowed_lateness and consider accumulating vs discarding triggers.

Why This Stack Works

Ultimately, this architectural choice – whether you leverage the full processing power of Google Cloud Dataflow for complex stream transformations or utilize the simplicity of direct Pub/Sub-to-BigQuery streaming via native subscriptions – provides the unbound flexibility required to precisely align your real-time analytics system with your unique latency demands (delivering insights at the speed of events with data immediately available for querying), processing complexity (from simple ingestion to intricate transformations), and cost objectives (leveraging serverless efficiency and optimized pricing), all while ensuring you can effortlessly scale to handle any volume of data.

Author

Scott Jensen
Co-Founder & Chief Technology Officer, Cloudnyx.ai

Page updated

Report abuse