airflow icon indicating copy to clipboard operation
airflow copied to clipboard

DAGs go missing after a while

Open gabriel-attie opened this issue 1 year ago • 7 comments

Apache Airflow version

2.9.3

If "Other Airflow 2 version" selected, which one?

2.9.3

What happened?

I am using the airflow locally using a custom Dockerfile and a docker-compose from the official URL with some small customization. I usually have a work flow like Extras, Transform and Load in separate DAGs and the las task for the ET are calling the next DAG in the flow.

My issue is that when I start to develop new DAGs locally, random tags start to go missing from the Webserver UI. when I go in the container and run the command "airflow tags list" my dogs are shown there (same with "airflow tags report"), but they are not present in the UI. If I run the command "airflow db init" or "airflow db migrate" the DAGs go back to show in the Webserver UI for a short time (around 30 seconds) and then go missing again.

What you think should happen instead?

The DAGs should be showing in the Webserver UI.

How to reproduce

Honestly, I have no idea how to reproduce the errors, since I can't find anything in the logs.

Operating System

PRETTY_NAME="Debian GNU/Linux 12 (bookworm)" NAME="Debian GNU/Linux" VERSION_ID="12" VERSION="12 (bookworm)" VERSION_CODENAME=bookworm ID=debian HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/"

Versions of Apache Airflow Providers

No response

Deployment

Docker-Compose

Deployment details

No response

Anything else?

This problem seems to happen when I run the "docker compose down && docker compose up -d" often when developing.

Are you willing to submit PR?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

gabriel-attie avatar Aug 08 '24 17:08 gabriel-attie

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

boring-cyborg[bot] avatar Aug 08 '24 17:08 boring-cyborg[bot]

Thank you for reporting this issue. To help us diagnose and reproduce the problem, could you please provide:

  1. Example DAGs that you are using when the issue occurs.
  2. The custom Dockerfile you're using.
  3. The docker-compose.yml file with your customizations.
  4. Any specific steps or operations that lead to the issue, it would be helpful if some attached screenshots are possible.

This information will help to better understand and address the problem. Thanks!

josix avatar Aug 09 '24 10:08 josix

I noticed similar thing on my installation with version 2.9.2. It's possible that the problem have been present for some time. It definitely doesnt sound like expected behaviourr. I suspect there's some kind of race condition due to very long parsing as in my case I deal with over 2k dags setup. Unable to see exact conditions that cause this.

The dags may suddenly reappear and then disaplear all over for the course of day. It almost seems like data gets removed for little period instead of "update" operation this way causing conditions when dag isn't in database so webserver doesn't retrieve it. From user experience it looks like setup with over 2k dags with frequently running scheduler if you spam F5 while looking of dashboard of webserver the number of dags you get as visible changes each time.

dimon222 avatar Aug 10 '24 18:08 dimon222

@josix:

1 - The DAGs doesn't really matter since they disappear randomly. But here is one example:

import logging

from airflow.decorators import dag, task
from airflow.models.param import Param
from airflow.operators.python import get_current_context

from common.tasks.general import trigger_another_dag
from common.tasks.teams import notify_failure
from common.services.etl import ETLService
from common.settings.car import *
from common.settings.dags import default_args
from common.settings.monitoring import UPDATE_RUNNING, WEEKLY
from common.settings.envs import START_DATE


logger = logging.getLogger("airflow.task")
etl_service = ETLService()


@dag(
    default_args=default_args,
    schedule_interval="@weekly",
    start_date=START_DATE,
    catchup=False,
    tags=["public"],
    params={
         "ignore_discrepancy": Param(False, type="boolean"),
         "emergency_mode": Param(None, type=["null", "string"])
    },
)
def update_car_extract():
    @task(on_failure_callback=notify_failure)
    def main_run_attrs() -> dict:
        """Core function Create monitoring instance

        :return: dict with data to monitoring this project
        """
        from common.models.car import DataModel

        context = get_current_context()

        return etl_service.start_monitoring(
            DataModel, context, UPDATE_RUNNING, WEEKLY
        )

    @task(on_failure_callback=notify_failure)
    def download_file_to_s3() -> str:
        """Core function to downloads files from source directly to S3.
        It can be set to run in emergency mode by a DAG conf.

        :return: string with s3 path to raw data
        """
        from common.models.sema_mt_car import DataModel

        context = get_current_context()

        # Set the url to download
        source_urls = {"zip": SOURCE}
        logger.info(f"The Source URLS: {source_urls}")

        result_download = etl_service.download_file(
            context, DataModel, source_urls, use_raw=False
        )

        return result_download

    dag_conf = {
        "main_run_attrs": main_run_attrs(),
        "raw_data_zip_path": download_file_to_s3(),
    }
    trigger_another_dag(
        dag_conf,
        "trigger_transform_dag",
        "update_car_transform",
    )

update_car_extract()

2 - Dockerfile

FROM apache/airflow:2.9.3-python3.11
COPY requirements.txt /requirements.txt
RUN pip install --upgrade pip --trusted-host pypi.org --trusted-host files.pythonhosted.org
RUN pip install --no-cache-dir -r /requirements.txt --trusted-host pypi.org --trusted-host files.pythonhosted.org
USER root
RUN apt-get update && \
    apt-get install --allow-downgrades -y libpq5=15.6-0+deb12u1 libmariadb3=1:10.11.6-0+deb12u1
RUN apt-get install -y libgdal-dev \
    gdal-bin \
    gcc \
    g++
RUN sudo apt-get install unrar-free -y
RUN sudo pip install geopandas --trusted-host pypi.org --trusted-host files.pythonhosted.org
RUN sudo pip install --global-option=build_ext --global-option="-I/usr/include/gdal" GDAL==`gdal-config --version` --trusted-host pypi.org --trusted-host files.pythonhosted.org
RUN sudo pip install --no-cache-dir rasterio --trusted-host pypi.org --trusted-host files.pythonhosted.org
RUN apt-get clean
USER airflow

3 - docker-compose.yaml

x-airflow-common:
  &airflow-common
  # In order to add custom dependencies or upgrade provider packages you can use your extended image.
  # Comment the image line, place your Dockerfile in the directory where you placed the docker-compose.yaml
  # and uncomment the "build" line below, Then run `docker-compose build` to build the images.
  image: my-tag:latest
  # build: .
  environment:
    &airflow-common-env
    AIRFLOW__CORE__EXECUTOR: LocalExecutor
    AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
    AIRFLOW__CORE__FERNET_KEY: ''
    AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
    AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
    AIRFLOW__API__AUTH_BACKENDS: 'airflow.api.auth.backend.basic_auth,airflow.api.auth.backend.session'
    AIRFLOW__WEBSERVER__SHOW_TRIGGER_FORM_IF_NO_PARAMS: 'true'
    AIRFLOW__WEBSERVER__EXPOSE_CONFIG: 'true'
    AIRFLOW__CORE__DEFAULT_TIMEZONE: 'America/Sao_Paulo'
    AIRFLOW__WEBSERVER__DAG_ORIENTATION: 'TB'
    AIRFLOW__LOGGING__COLORED_CONSOLE_LOG: 'true'
    AIRFLOW__SCHEDULER__SCHEDULER_ZOMBIE_TASK_THRESHOLD: 600
    # yamllint disable rule:line-length
    # Use simple http server on scheduler for health checks
    # See https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/check-health.html#scheduler-health-check-server
    # yamllint enable rule:line-length
    AIRFLOW__SCHEDULER__ENABLE_HEALTH_CHECK: 'true'
    # WARNING: Use _PIP_ADDITIONAL_REQUIREMENTS option ONLY for a quick checks
    # for other purpose (development, test and especially production usage) build/extend Airflow image.
    _PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-}
    # The following line can be used to set a custom config file, stored in the local config folder
    # If you want to use it, outcomment it and replace airflow.cfg with the name of your config file
    # AIRFLOW_CONFIG: '/opt/airflow/config/airflow.cfg'
  volumes:
    - ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags
    - ${AIRFLOW_PROJ_DIR:-.}/logs:/opt/airflow/logs
    - ${AIRFLOW_PROJ_DIR:-.}/config:/opt/airflow/config
    - ${AIRFLOW_PROJ_DIR:-.}/plugins:/opt/airflow/plugins
    - ${AIRFLOW_PROJ_DIR:-.}/common:/opt/airflow/plugins/common
    - $HOME/.aws:/home/airflow/.aws
  user: "${AIRFLOW_UID:-50000}:0"
  depends_on:
    &airflow-common-depends-on
    postgres:
      condition: service_healthy

services:
  postgres:
    image: postgis/postgis:13-3.4
    platform: linux/amd64
    environment:
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: airflow
      POSTGRES_DB: airflow
    volumes:
      - postgres-db-volume:/var/lib/postgresql/data
    ports:
      - "5432:5432"
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "airflow"]
      interval: 10s
      retries: 5
      start_period: 5s
    restart: always

  airflow-webserver:
    <<: *airflow-common
    command: webserver
    ports:
      - "8080:8080"
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 30s
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

  airflow-scheduler:
    <<: *airflow-common
    command: scheduler
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:8974/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 30s
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

  airflow-triggerer:
    <<: *airflow-common
    command: triggerer
    healthcheck:
      test: ["CMD-SHELL", 'airflow jobs check --job-type TriggererJob --hostname "$${HOSTNAME}"']
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 30s
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

  airflow-init:
    <<: *airflow-common
    entrypoint: /bin/bash
    # yamllint disable rule:line-length
    command:
      - -c
      - |
        if [[ -z "${AIRFLOW_UID}" ]]; then
          echo
          echo -e "\033[1;33mWARNING!!!: AIRFLOW_UID not set!\e[0m"
          echo "If you are on Linux, you SHOULD follow the instructions below to set "
          echo "AIRFLOW_UID environment variable, otherwise files will be owned by root."
          echo "For other operating systems you can get rid of the warning with manually created .env file:"
          echo "    See: https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#setting-the-right-airflow-user"
          echo
        fi
        one_meg=1048576
        mem_available=$$(($$(getconf _PHYS_PAGES) * $$(getconf PAGE_SIZE) / one_meg))
        cpus_available=$$(grep -cE 'cpu[0-9]+' /proc/stat)
        disk_available=$$(df / | tail -1 | awk '{print $$4}')
        warning_resources="false"
        if (( mem_available < 4000 )) ; then
          echo
          echo -e "\033[1;33mWARNING!!!: Not enough memory available for Docker.\e[0m"
          echo "At least 4GB of memory required. You have $$(numfmt --to iec $$((mem_available * one_meg)))"
          echo
          warning_resources="true"
        fi
        if (( cpus_available < 2 )); then
          echo
          echo -e "\033[1;33mWARNING!!!: Not enough CPUS available for Docker.\e[0m"
          echo "At least 2 CPUs recommended. You have $${cpus_available}"
          echo
          warning_resources="true"
        fi
        if (( disk_available < one_meg * 10 )); then
          echo
          echo -e "\033[1;33mWARNING!!!: Not enough Disk space available for Docker.\e[0m"
          echo "At least 10 GBs recommended. You have $$(numfmt --to iec $$((disk_available * 1024 )))"
          echo
          warning_resources="true"
        fi
        if [[ $${warning_resources} == "true" ]]; then
          echo
          echo -e "\033[1;33mWARNING!!!: You have not enough resources to run Airflow (see above)!\e[0m"
          echo "Please follow the instructions to increase amount of resources available:"
          echo "   https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#before-you-begin"
          echo
        fi
        mkdir -p /sources/logs /sources/dags /sources/plugins /sources/common
        chown -R "${AIRFLOW_UID}:0" /sources/{logs,dags,plugins,common}
        exec /entrypoint airflow version
    # yamllint enable rule:line-length
    environment:
      <<: *airflow-common-env
      _AIRFLOW_DB_MIGRATE: 'true'
      _AIRFLOW_WWW_USER_CREATE: 'true'
      _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}
      _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}
      _PIP_ADDITIONAL_REQUIREMENTS: ''
    user: "0:0"
    volumes:
      - ${AIRFLOW_PROJ_DIR:-.}:/sources

  airflow-cli:
    <<: *airflow-common
    profiles:
      - debug
    environment:
      <<: *airflow-common-env
      CONNECTION_CHECK_MAX_COUNT: "0"
    # Workaround for entrypoint issue. See: https://github.com/apache/airflow/issues/16252
    command:
      - bash
      - -c
      - airflow

  # You can enable flower by adding "--profile flower" option e.g. docker-compose --profile flower up
  # or by explicitly targeted on the command line e.g. docker-compose up flower.
  # See: https://docs.docker.com/compose/profiles/

volumes:
  postgres-db-volume:

4 - There are no specific conditions in where the dogs go missing. I do suspect thought on the docker compose down and up too frequently.

In the moment I do not have screenshots showing the how the files goes missing in the web server. But it literally just goes missing, from 16 DAGs for example, I refresh the page (F5) and it's now with 14 DAGs.

For context: in our dev and production environment this does not occur. Only in the local environment. Usually in the local I have around 30 DAGs and in production we have around 300+ with codes going to 2k lines.

gabriel-attie avatar Aug 12 '24 12:08 gabriel-attie

Here I have the problema again. I have 19 DAGs in my local airflow in the moment. 7 just went missing.

image

But if I run "airflow dags list" I can see all the 19 DAGs: image

After running "airflow db migrate" the DAGs show up in the Webserver again: image

I have no idea how to reproduce it, but it seems its always after I stop the containers and run them again.

gabriel-attie avatar Aug 12 '24 17:08 gabriel-attie

Hi @gabriel-attie, do you find any error in the docker container logs?

jsjasonseba avatar Aug 25 '24 18:08 jsjasonseba

Hi @gabriel-attie, do you find any error in the docker container logs?

Nothing related to any missing DAGs. (2 import errors which I'm aware of - not an issue in local).

gabriel-attie avatar Aug 26 '24 17:08 gabriel-attie

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

github-actions[bot] avatar Sep 10 '24 00:09 github-actions[bot]

This issue has been closed because it has not received response from the issue author.

github-actions[bot] avatar Sep 18 '24 00:09 github-actions[bot]