Modern data teams don’t just need answers — they need answers as events happen.
In this guide, we walk through a scalable, referenceable, production-ready pipeline for real-time and interactive analytics leveraging Google Cloud Pub/Sub for real-time ingestion, Apache Beam on Dataflow for stream processing, and BigQuery for storage and querying. This architecture delivers streaming insights with sub-second availability. Flexibility is built in: for simpler use cases, teams can implement a simplified, serverless architecture, skipping Dataflow entirely and streaming data directly from Pub/Sub to BigQuery using native subscriptions, enabling immediate availability for querying.
At a high level, this pipeline supports two common configurations:
Option 1: Standard Architecture (with Dataflow)
[ Event Producers ] → [ Pub/Sub ]
↓
[ Dataflow Pipeline ]
↓
[ BigQuery Tables ]
Option 2: Simplified (Serverless) Architecture
[ Event Producers ] → [ Pub/Sub ]
↓
[ BigQuery Tables ]
Step 1: Ingest Events with Pub/Sub
Pub/Sub serves as the real-time ingestion layer for event streams. As a globally available service, it is a fully managed, serverless messaging service that provides high availability through synchronous, cross-zone message replication, ensuring reliable delivery at any scale. Built on the universally solid message queuing pattern, it enables high-speed, asynchronous data ingestion, effectively decoupling event producers from downstream processing. It supports at-least-once message delivery, and offers optional per-key in-order delivery. Pub/Sub features native integrations with Google Cloud services, making it an ideal entry point for real-time pipelines feeding systems like Dataflow for stream processing. And BigQuery for streaming ingestion. Security is built-in with fine-grained access controls and always-on encryption.
$ gcloud pubsub topics create events-stream
$ gcloud pubsub subscriptions create events-sub \
--topic=events-stream \
--enable-message-ordering
Best Practices
Use dead-letter topics for failed or malformed messages
Enable message ordering if sequence matters
Set message retention to 7 days for replay support
For teams that require capabilities like event enrichment, normalization (leveraging transformations), or windowed aggregations (utilizing state and time capabilities and features like FixedWindows) on their event streams, Google Cloud Dataflow, which is a fully managed streaming platform built on the open source Apache Beam SDK, offers full control over these complex stream processing tasks. Dataflow, as a fully managed service, enables scalable ETL pipelines, real-time stream analytics, real-time ML, and complex data transformations using Apache Beam's unified model for both batch and streaming processing.
Here's an example in Python:
import apache_beam as beam
import json
def transform_event(event):
event['event_type'] = event['event_type'].lower()
return event
with beam.Pipeline(options=...) as p:
(
p
| 'ReadFromPubSub' >> beam.io.ReadFromPubSub(subscription='...')
| 'ParseJSON' >> beam.Map(json.loads)
| 'Window' >> beam.WindowInto(beam.window.FixedWindows(60))
| 'Transform' >> beam.Map(transform_event)
| 'WriteToBigQuery' >> beam.io.WriteToBigQuery(
table='project.dataset.events',
schema='event_type:STRING,event_time:TIMESTAMP,...',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
)
)
Dataflow Tips
Use FixedWindows for real-time rollups
Apply AllowedLateness to catch delayed events
Leverage schema-aware PTransforms for clarity and validation
Step 3: Stream into and Query from BigQuery
Whether Dataflow is processing your data before reaching its destination or being streamed directly from Pub/Sub using native Pub/Sub BigQuery subscriptions, BigQuery supports streaming inserts with sub-second availability. This means the ingested data is immediately available for querying, making BigQuery ideal for powering live dashboards or real-time alerting systems that require insights from the freshest data.
SELECT event_type
, COUNT(*) AS event_count
, TIMESTAMP_TRUNC(event_time, MINUTE) AS bucket
FROM `project.dataset.events`
WHERE 1=1
and event_time >= TIMESTAMP_SUB(
CURRENT_TIMESTAMP(), INTERVAL 1 HOUR
)
GROUP BY bucket, event_type
ORDER BY bucket DESC
Schema Best Practices
Partition by event_time
Cluster by user_id, event_type, or other high-cardinality fields
Use least-privilege roles and clearly scoped service accounts:
Security Enhancements
Enable CMEK for encryption
Use VPC-SC to restrict access perimeter
Turn on audit logs for full traceability
Hot Keys: Uneven distribution can throttle Dataflow workers.
Backpressure: Always implement retries and backoff at the producer side.
Late Events: Use allowed_lateness and consider accumulating vs discarding triggers.
Ultimately, this architectural choice – whether you leverage the full processing power of Google Cloud Dataflow for complex stream transformations or utilize the simplicity of direct Pub/Sub-to-BigQuery streaming via native subscriptions – provides the unbound flexibility required to precisely align your real-time analytics system with your unique latency demands (delivering insights at the speed of events with data immediately available for querying), processing complexity (from simple ingestion to intricate transformations), and cost objectives (leveraging serverless efficiency and optimized pricing), all while ensuring you can effortlessly scale to handle any volume of data.