Skip to main content
โšก Calmops

Data Engineering

Practical data engineering hub: ETL/ELT, data lakes & warehouses, streaming, pipelines, observability, and tooling (dbt, Airflow, Kafka, Spark) for production teams in 2026.

Data Engineering Hub

Practical guidance for building reliable, observable, and cost-effective data pipelines and platforms. This hub covers batch and streaming ETL/ELT, data lakehouse patterns, orchestration, real-time processing, data quality, governance, and the tools widely used in 2025โ€“2026.


๐Ÿš€ Getting started

New to data engineering? Start with these entry points:


๐Ÿ“š Main categories

๐Ÿงฑ Core Patterns & Architectures (10+ articles)

Design patterns and high-level architectures for reliable data platforms.

  • Data lake vs data warehouse vs lakehouse
  • ETL vs ELT patterns
  • Data mesh & domain-oriented ownership
  • Change data capture (CDC) and event-driven pipelines

โš™๏ธ Orchestration & Workflow (8+ articles)

Tools and patterns for managing pipelines.

  • Apache Airflow: DAG design, scaling, sensors
  • Dagster, Prefect: modern orchestration alternatives
  • Scheduling strategies, retries, and backfills

โšก Streaming & Real-time (12+ articles)

Low-latency data movement and processing.

  • Kafka fundamentals: topics, partitions, consumer groups
  • Stream processing: Flink, Kafka Streams, ksqlDB
  • Exactly-once semantics, windowing, watermarking

๐Ÿงฉ Data Transformation & Modeling (10+ articles)

Transformations, incremental pipelines, and analytics models.

  • dbt for analytics engineering and model testing
  • SQL-based transformations vs code-based transformations
  • Dimensional modeling, slowly changing dimensions (SCD)

๐Ÿ—„๏ธ Storage & Lakehouse (10+ articles)

Storage choices and analytical engines.

  • Object storage patterns (S3, MinIO) and partitioning
  • Lakehouse architectures: Delta Lake, Iceberg, Hudi
  • Analytics engines: ClickHouse, DuckDB, Snowflake

๐Ÿงช Data Quality & Observability (6+ articles)

Ensuring data correctness and pipeline health.

  • Monitoring: metrics, logs, and tracing for data jobs
  • Data testing: assertions, schema checks, Great Expectations
  • Lineage and provenance for debugging and audits

๐Ÿ” Governance & Security (6+ articles)

Policies, access control, and compliance.

  • Data cataloging and discovery
  • Masking, encryption, and PII handling
  • Data contracts and SLA agreements between producers and consumers

๐Ÿ› ๏ธ Tooling & Ecosystem (10+ articles)

Tool comparisons and integration patterns.

  • Kafka ecosystem: Confluent, Redpanda, Kafka Connect
  • CDC tooling: Debezium, Airbyte, Fivetran
  • Orchestration + transformation combos: Airflow + dbt, Dagster + SQL

๐ŸŽฏ Learning paths

Path 1: Data Engineering Fundamentals (4โ€“8 weeks)

  1. Data Engineering Fundamentals โ€” core concepts and vocabulary
  2. ETL vs ELT comparison โ€” choose pipeline types
  3. Batch orchestration with Airflow โ€” scheduling and retries
  4. Basic monitoring and alerts for pipelines
    Outcome: Build, schedule, and monitor a simple reliable batch pipeline.

Path 2: Streaming Engineer (8โ€“12 weeks)

  1. Kafka fundamentals โ€” topics, partitioning, consumer patterns
  2. Stream processing with Flink or Kafka Streams โ€” windowing & state
  3. Exactly-once and consistency patterns โ€” transactional sinks, idempotency
  4. Observability for streaming pipelines โ€” metrics and end-to-end tracing
    Outcome: Deliver a low-latency streaming pipeline with measurable SLAs.

Path 3: Analytics & Lakehouse (6โ€“10 weeks)

  1. Storage fundamentals โ€” object stores and partitioning strategies
  2. dbt for transformations โ€” models, testing, and documentation
  3. Lakehouse technologies โ€” Delta/Iceberg/Hudi + query engines (ClickHouse/Snowflake)
  4. Cost optimization and cluster sizing for analytics workloads
    Outcome: Deploy an end-to-end analytics pipeline delivering reliable BI datasets.

Path 4: Platform Engineer for Data (10โ€“16 weeks)

  1. Design a data platform: multi-tenant, domain ownership, data mesh concepts
  2. Implement CI/CD for pipelines and dbt models
  3. Implement data governance, lineage, and RBAC
  4. Automate onboarding for new data producers and consumers
    Outcome: Operate and scale a data platform used by multiple teams.

๐Ÿ“Š Key statistics (site snapshot)

  • Hub articles: 80+ (including orchestration, streaming, lakehouse, and tooling)
  • Common tools covered: Kafka, Airflow, dbt, Flink, Spark, ClickHouse, DuckDB, Debezium, Airbyte, Snowflake
  • Typical production targets: pipeline latency (batch: minutes โ†’ hours, streaming: sub-second โ†’ seconds), data freshness SLAs, pct error budget for upstream failures

๐Ÿ”— Quick reference

Orchestration comparison

Tool Best for Strengths
Airflow Batch orchestration Mature, extensible, large ecosystem
Dagster Testable pipelines Type-safe pipelines, good developer UX
Prefect Hybrid scheduling Modern API, easier retries and flows

Streaming decision matrix

Concern Kafka + Flink Kafka Streams ksqlDB
Stateful processing Excellent Good Limited
Operational complexity Higher Lower Low-medium
SQL-like transforms Needs code Java/Scala Native SQL

CDC vs Batch

  • Use CDC when you need low-latency replication and event-driven systems.
  • Use batch for bulk transforms, heavy aggregations, or when source load must be minimized.

๐Ÿ“š Browse all articles

Click to expand complete article list (80+ articles)

A

B

C

D

E

F

G

M

R

S

(Full list preserved in repository folders; expand individual article pages for details.)


๐ŸŽ“ Who this hub is for

  • Data engineers building and operating ETL/ELT and streaming systems
  • Platform engineers creating self-service data platforms and pipelines
  • Analytics engineers using dbt and SQL to produce reliable BI datasets
  • SREs and operators responsible for data pipeline reliability and costs
  • Product engineers integrating real-time data into user-facing features

๐Ÿ“– External resources