Troubleshooting Library
Troubleshooting Library
Section titled “Troubleshooting Library”Real errors. Real fixes. No Stack Overflow copypasta.
This is a living document. Community members contribute their war stories so the next person doesn’t lose 3 hours to the same bug.
Docker & Containers
Section titled “Docker & Containers”Container exits immediately / Exited (0)
Section titled “Container exits immediately / Exited (0)”Symptom: docker-compose up runs but containers immediately stop.
Cause: No foreground process to keep the container alive.
Fix: Make sure your entrypoint/CMD runs in the foreground, not as a daemon. Example for Airflow: use airflow standalone not airflow webserver -D.
”Permission denied” on volume mounts (Mac/Linux)
Section titled “”Permission denied” on volume mounts (Mac/Linux)”Symptom: Container writes fail with permission errors on mounted directories. Fix: Set the correct owner in the Dockerfile:
RUN chown -R 1000:1000 /opt/airflow/logsOr use user: "${UID}:${GID}" in docker-compose.yml.
Postgres container keeps restarting
Section titled “Postgres container keeps restarting”Cause: Usually corrupt data directory from a previous failed initialization. Fix:
docker-compose down -v # removes volumes toodocker-compose upWarning: -v deletes all data.
Apache Airflow
Section titled “Apache Airflow”DAG not appearing in UI
Section titled “DAG not appearing in UI”Causes & fixes:
- Syntax error — run
python your_dag.pydirectly to see the error - DAG is in wrong folder — check
dags_folderinairflow.cfg - Scheduler hasn’t picked it up yet — wait 30s or restart scheduler
catchup=Truewith old start_date — setcatchup=Falseor use a recentstart_date
Scheduler shows zombie tasks / tasks stuck in “running”
Section titled “Scheduler shows zombie tasks / tasks stuck in “running””Cause: Workers died mid-task, leaving orphaned task instances. Fix:
airflow tasks clear <dag_id> -s <start_date> -e <end_date> --yesOr use the UI: Browse → Task Instances → filter by state “running” → Mark Failed.
ModuleNotFoundError inside Airflow tasks
Section titled “ModuleNotFoundError inside Airflow tasks”Cause: Package installed in host env but not in Airflow’s Python env. Fix: Install inside the container, or use a custom Docker image. For docker-compose:
x-airflow-common: &airflow-common build: context: . dockerfile: Dockerfile.airflowDatabase Error: column does not exist
Section titled “Database Error: column does not exist”Cause: Model references a column that doesn’t exist yet in the source or upstream model. Fixes:
- Run
dbt compileto see the full SQL before execution - Check that your ref() points to the correct model
- Run
dbt run --select <upstream_model>+to build deps first
Tests taking forever (2+ hours)
Section titled “Tests taking forever (2+ hours)”Cause: Running all tests sequentially on large tables. Fix:
- Use
--threads 4(or more) to parallelize - Add
WHEREfilters to custom tests - Use
store_failures = trueindbt_project.ymlto skip re-running passing tests
Compilation Error: depends on a node named 'X' which was not found
Section titled “Compilation Error: depends on a node named 'X' which was not found”Cause: Model name mismatch between ref('model_name') and the actual filename.
Fix: dbt model names are case-sensitive. ref('Orders') ≠ ref('orders').
PySpark
Section titled “PySpark”Job runs slower than equivalent Python script
Section titled “Job runs slower than equivalent Python script”Cause: Usually too many small partitions, or data is not actually distributed. Fix:
# Check partition countdf.rdd.getNumPartitions()
# Repartition for better parallelismdf = df.repartition(200) # rule of thumb: 2-3x num cores
# For joins, broadcast small tablesfrom pyspark.sql.functions import broadcastdf = large_df.join(broadcast(small_df), "key")OutOfMemoryError: GC overhead limit exceeded
Section titled “OutOfMemoryError: GC overhead limit exceeded”Cause: Executor memory too low, or data skew causing one partition to hold too much. Fix:
spark = SparkSession.builder \ .config("spark.executor.memory", "4g") \ .config("spark.driver.memory", "2g") \ .config("spark.sql.shuffle.partitions", "200") \ .getOrCreate()Check for skew: df.groupBy("key").count().orderBy("count", ascending=False).show().
PostgreSQL
Section titled “PostgreSQL”Queries slow on large tables (45+ seconds)
Section titled “Queries slow on large tables (45+ seconds)”Fix checklist:
EXPLAIN ANALYZEyour query — find the sequential scans- Add indexes on columns in
WHERE,JOIN, andORDER BY - Use
VACUUM ANALYZEon the table - For repeated aggregations: add a materialized view
-- Example: slow dashboard query fixCREATE INDEX CONCURRENTLY idx_orders_created_at ON orders(created_at DESC);CREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders(customer_id);
-- Materialized view for expensive aggregationCREATE MATERIALIZED VIEW mv_daily_revenue ASSELECT DATE(created_at) as day, SUM(amount) as revenueFROM orders GROUP BY 1;
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_daily_revenue;FATAL: too many connections
Section titled “FATAL: too many connections”Cause: Connection pool exhausted. Fix: Use PgBouncer for connection pooling. For docker-compose setups, add pgbouncer as a service. Max connections default is 100; each Airflow worker opens its own connections.
Kafka / Streaming
Section titled “Kafka / Streaming”Consumer lag keeps growing
Section titled “Consumer lag keeps growing”Causes:
- Consumer is slower than producer (most common)
- Too few partitions — consumers can’t parallelize
- Message processing is blocking
Fixes:
- Increase partition count:
kafka-topics.sh --alter --topic your-topic --partitions 12 - Use batching in consumer:
max_poll_records=500 - Move slow processing to async workers
Messages processed twice / duplicate processing
Section titled “Messages processed twice / duplicate processing”Cause: Consumer crashes after processing but before committing offset.
Fix: Make processing idempotent. Use enable.auto.commit=false and commit only after successful processing + write.
Terraform
Section titled “Terraform”State lock error: Error acquiring the state lock
Section titled “State lock error: Error acquiring the state lock”Cause: Previous terraform apply was interrupted; DynamoDB lock not released.
Fix:
terraform force-unlock <LOCK_ID>Get lock ID from the error message. Only run this if you’re sure no other apply is running.
Error: Provider produced inconsistent result after apply
Section titled “Error: Provider produced inconsistent result after apply”Cause: Provider bug or resource drift between plan and apply.
Fix: Run terraform refresh then terraform plan again. If it persists, check provider version and pin it.
S3 403 Forbidden despite correct IAM role
Section titled “S3 403 Forbidden despite correct IAM role”Checklist:
- Check the bucket policy — it may explicitly deny even with IAM allow
- Check if bucket has Block Public Access settings that override policy
- Verify the role is attached to the right resource (EC2 instance, Lambda, etc.)
- S3 uses account-level Block Public Access — check at account level too
Lambda timeout on large data processing
Section titled “Lambda timeout on large data processing”Fix: For large files, use S3 Select to filter before reading, or trigger an ECS/Fargate task for heavy processing. Lambda max timeout is 15 minutes.
Contributing
Section titled “Contributing”See something that should be here? Drop it in #troubleshooting on Discord or open a PR.
Format: Symptom → Cause → Fix. Include the actual error message when possible.