Troubleshooting Library

Real errors. Real fixes. No Stack Overflow copypasta.

This is a living document. Community members contribute their war stories so the next person doesn’t lose 3 hours to the same bug.

Docker & Containers

Container exits immediately / `Exited (0)`

Symptom: docker-compose up runs but containers immediately stop. Cause: No foreground process to keep the container alive. Fix: Make sure your entrypoint/CMD runs in the foreground, not as a daemon. Example for Airflow: use airflow standalone not airflow webserver -D.

”Permission denied” on volume mounts (Mac/Linux)

Symptom: Container writes fail with permission errors on mounted directories. Fix: Set the correct owner in the Dockerfile:

RUN chown -R 1000:1000 /opt/airflow/logs

Or use user: "${UID}:${GID}" in docker-compose.yml.

Postgres container keeps restarting

Cause: Usually corrupt data directory from a previous failed initialization. Fix:

docker-compose down -v   # removes volumes too
docker-compose up

Warning: -v deletes all data.

Apache Airflow

DAG not appearing in UI

Causes & fixes:

Syntax error — run python your_dag.py directly to see the error
DAG is in wrong folder — check dags_folder in airflow.cfg
Scheduler hasn’t picked it up yet — wait 30s or restart scheduler
catchup=True with old start_date — set catchup=False or use a recent start_date

Scheduler shows zombie tasks / tasks stuck in “running”

Cause: Workers died mid-task, leaving orphaned task instances. Fix:

airflow tasks clear <dag_id> -s <start_date> -e <end_date> --yes

Or use the UI: Browse → Task Instances → filter by state “running” → Mark Failed.

`ModuleNotFoundError` inside Airflow tasks

Cause: Package installed in host env but not in Airflow’s Python env. Fix: Install inside the container, or use a custom Docker image. For docker-compose:

x-airflow-common:
  &airflow-common
  build:
    context: .
    dockerfile: Dockerfile.airflow

dbt

`Database Error: column does not exist`

Cause: Model references a column that doesn’t exist yet in the source or upstream model. Fixes:

Run dbt compile to see the full SQL before execution
Check that your ref() points to the correct model
Run dbt run --select <upstream_model>+ to build deps first

Tests taking forever (2+ hours)

Cause: Running all tests sequentially on large tables. Fix:

Use --threads 4 (or more) to parallelize
Add WHERE filters to custom tests
Use store_failures = true in dbt_project.yml to skip re-running passing tests

`Compilation Error: depends on a node named 'X' which was not found`

Cause: Model name mismatch between ref('model_name') and the actual filename. Fix: dbt model names are case-sensitive. ref('Orders') ≠ ref('orders').

PySpark

Job runs slower than equivalent Python script

Cause: Usually too many small partitions, or data is not actually distributed. Fix:

# Check partition count
df.rdd.getNumPartitions()

# Repartition for better parallelism
df = df.repartition(200)  # rule of thumb: 2-3x num cores

# For joins, broadcast small tables
from pyspark.sql.functions import broadcast
df = large_df.join(broadcast(small_df), "key")

`OutOfMemoryError: GC overhead limit exceeded`

Cause: Executor memory too low, or data skew causing one partition to hold too much. Fix:

spark = SparkSession.builder \
    .config("spark.executor.memory", "4g") \
    .config("spark.driver.memory", "2g") \
    .config("spark.sql.shuffle.partitions", "200") \
    .getOrCreate()

Check for skew: df.groupBy("key").count().orderBy("count", ascending=False).show().

PostgreSQL

Queries slow on large tables (45+ seconds)

Fix checklist:

EXPLAIN ANALYZE your query — find the sequential scans
Add indexes on columns in WHERE, JOIN, and ORDER BY
Use VACUUM ANALYZE on the table
For repeated aggregations: add a materialized view

-- Example: slow dashboard query fix
CREATE INDEX CONCURRENTLY idx_orders_created_at ON orders(created_at DESC);
CREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders(customer_id);

-- Materialized view for expensive aggregation
CREATE MATERIALIZED VIEW mv_daily_revenue AS
SELECT DATE(created_at) as day, SUM(amount) as revenue
FROM orders GROUP BY 1;

REFRESH MATERIALIZED VIEW CONCURRENTLY mv_daily_revenue;

`FATAL: too many connections`

Cause: Connection pool exhausted. Fix: Use PgBouncer for connection pooling. For docker-compose setups, add pgbouncer as a service. Max connections default is 100; each Airflow worker opens its own connections.

Kafka / Streaming

Consumer lag keeps growing

Causes:

Consumer is slower than producer (most common)
Too few partitions — consumers can’t parallelize
Message processing is blocking

Fixes:

Increase partition count: kafka-topics.sh --alter --topic your-topic --partitions 12
Use batching in consumer: max_poll_records=500
Move slow processing to async workers

Messages processed twice / duplicate processing

Cause: Consumer crashes after processing but before committing offset. Fix: Make processing idempotent. Use enable.auto.commit=false and commit only after successful processing + write.

Terraform

State lock error: `Error acquiring the state lock`

Cause: Previous terraform apply was interrupted; DynamoDB lock not released. Fix:

terraform force-unlock <LOCK_ID>

Get lock ID from the error message. Only run this if you’re sure no other apply is running.

`Error: Provider produced inconsistent result after apply`

Cause: Provider bug or resource drift between plan and apply. Fix: Run terraform refresh then terraform plan again. If it persists, check provider version and pin it.

AWS

S3 403 Forbidden despite correct IAM role

Checklist:

Check the bucket policy — it may explicitly deny even with IAM allow
Check if bucket has Block Public Access settings that override policy
Verify the role is attached to the right resource (EC2 instance, Lambda, etc.)
S3 uses account-level Block Public Access — check at account level too

Lambda timeout on large data processing

Fix: For large files, use S3 Select to filter before reading, or trigger an ECS/Fargate task for heavy processing. Lambda max timeout is 15 minutes.

Contributing

See something that should be here? Drop it in #troubleshooting on Discord or open a PR. Format: Symptom → Cause → Fix. Include the actual error message when possible.

Troubleshooting Library

Troubleshooting Library

Docker & Containers

Container exits immediately / Exited (0)

”Permission denied” on volume mounts (Mac/Linux)

Postgres container keeps restarting

Apache Airflow

DAG not appearing in UI

Scheduler shows zombie tasks / tasks stuck in “running”

ModuleNotFoundError inside Airflow tasks

dbt

Database Error: column does not exist

Tests taking forever (2+ hours)

Compilation Error: depends on a node named 'X' which was not found

PySpark

Job runs slower than equivalent Python script

OutOfMemoryError: GC overhead limit exceeded

PostgreSQL

Queries slow on large tables (45+ seconds)

FATAL: too many connections

Kafka / Streaming

Consumer lag keeps growing

Messages processed twice / duplicate processing

Terraform

State lock error: Error acquiring the state lock

Error: Provider produced inconsistent result after apply

AWS

S3 403 Forbidden despite correct IAM role

Lambda timeout on large data processing

Contributing

Container exits immediately / `Exited (0)`

`ModuleNotFoundError` inside Airflow tasks

`Database Error: column does not exist`

`Compilation Error: depends on a node named 'X' which was not found`

`OutOfMemoryError: GC overhead limit exceeded`

`FATAL: too many connections`

State lock error: `Error acquiring the state lock`

`Error: Provider produced inconsistent result after apply`