trigger.dev icon indicating copy to clipboard operation
trigger.dev copied to clipboard

Self-hosted runs not completing

Open rharkor opened this issue 11 months ago • 8 comments

Provide environment information

System: OS: Linux 6.8 Ubuntu 24.04.1 LTS 24.04.1 LTS (Noble Numbat) CPU: (8) x64 Intel(R) Xeon(R) CPU E3-1230 v3 @ 3.30GHz Memory: 14.35 GB / 31.29 GB Container: Yes Shell: 5.2.21 - /bin/bash Binaries: Node: 18.19.1 - /usr/bin/node npm: 9.2.0 - /usr/bin/npm

Describe the bug

Some task are stuck indefinitely in waiting for no reasons. But it's random because it wont always be stuck

Image

Image

Image

Reproduction repo

No idea

To reproduce

No idea on how to reproduce but maybe I am using something wrong, there's my code:

import { prisma } from "@dimension/core-lib/src/prisma"
import { Shop, ShopCookie } from "@dimension/database-main"
import { tiktokMessages } from "@dimension/tiktok-messages"
import { transformCookie } from "@dimension/tiktok-support-messages"
import { logger, schedules, task } from "@trigger.dev/sdk/v3"

import { handleError, handleMaxDuration, maxDurationPending } from "../lib/error"
import { getActiveShops, wrapCronJob } from "../lib/utils"

const isEnabled = true
const maxDurationWarning = 1000 * 60 * 20 // 20 minutes
const name = "Process shops messages campaigns"
const cron = "0 * * * *" // Every hour

export const processShopMessagesCampaigns = task({
  id: "process-shop-messages-campaigns",
  run: async ({ shop }: { shop: Shop & { cookie: ShopCookie | null } }) => {
    /* Some DB operations */
})

export const processShopsMessagesCampaigns = schedules.task({
  id: "process-shops-messages-campaigns",
  cron: isEnabled ? cron : undefined,
  run: async () => {
    const now = new Date()

    const main = async () => {
      const shops = await getActiveShops(true)
      if (!shops.length) {
        logger.log("No shops found")
        return
      }

      await processShopMessagesCampaigns.batchTriggerAndWait(
        shops.map((shop) => ({
          payload: { shop },
          options: {
            tags: [shop.slug],
          },
        }))
      )
    }

    const checkDuration = maxDurationPending(name, maxDurationWarning)

    await main()
  },
})

Additional information

No response

rharkor avatar Feb 05 '25 19:02 rharkor

This is the only problem I am encountering but it is very problematic since I made a policy of non overlapping crons this is blocking the whole process for the new crons

rharkor avatar Feb 06 '25 08:02 rharkor

This is a self-hosted deployment

matt-aitken avatar Feb 13 '25 13:02 matt-aitken

For me worker container can't connect to coordinator using websocket, because of this run gets stuck. @rharkor could you please share your compose file?

murshudov avatar Feb 13 '25 18:02 murshudov

services:
  trigger:
    image: ghcr.io/triggerdotdev/trigger.dev:v3
    environment:
      REMIX_APP_PORT: 3000
      NODE_ENV: production
      RUNTIME_PLATFORM: docker-compose
      V3_ENABLED: true
      TRIGGER_TELEMETRY_DISABLED: 1
      INTERNAL_OTEL_TRACE_DISABLED: 1
      INTERNAL_OTEL_TRACE_LOGGING_ENABLED: 0
      POSTGRES_USER: $POSTGRES_USER
      POSTGRES_PASSWORD: $POSTGRES_PASSWORD
      POSTGRES_DB: $POSTGRES_DB
      MAGIC_LINK_SECRET: $MAGIC_LINK_SECRET
      SESSION_SECRET: $SESSION_SECRET
      ENCRYPTION_KEY: $ENCRYPTION_KEY
      PROVIDER_SECRET: $PROVIDER_SECRET
      COORDINATOR_SECRET: $COORDINATOR_SECRET
      DATABASE_URL: 'postgres://$POSTGRES_USER:$POSTGRES_PASSWORD@postgresql:5432/$POSTGRES_DB?sslmode=disable'
      DIRECT_URL: 'postgres://$POSTGRES_USER:$POSTGRES_PASSWORD@postgresql:5432/$POSTGRES_DB?sslmode=disable'
      REDIS_HOST: redis
      REDIS_PORT: 6379
      REDIS_TLS_DISABLED: true
      COORDINATOR_HOST: 127.0.0.1
      COORDINATOR_PORT: 9020
      WHITELISTED_EMAILS: ''
      ADMIN_EMAILS: $ADMIN_EMAILS
      DEFAULT_ORG_EXECUTION_CONCURRENCY_LIMIT: 300
      DEFAULT_ENV_EXECUTION_CONCURRENCY_LIMIT: 100
      DEPLOY_REGISTRY_HOST: $DEPLOY_REGISTRY_HOST
      DEPLOY_REGISTRY_NAMESPACE: $DEPLOY_REGISTRY_NAMESPACE
      REGISTRY_HOST: $DEPLOY_REGISTRY_HOST
      REGISTRY_NAMESPACE: $DEPLOY_REGISTRY_NAMESPACE
      EMAIL_TRANSPORT: $EMAIL_TRANSPORT
      FROM_EMAIL: $FROM_EMAIL
      REPLY_TO_EMAIL: $REPLY_TO_EMAIL
      SMTP_HOST: $SMTP_HOST
      SMTP_PORT: $SMTP_PORT
      SMTP_SECURE: $SMTP_SECURE
      SMTP_USER: $SMTP_USER
      SMTP_PASSWORD: $SMTP_PASSWORD
      LOGIN_ORIGIN: ${SERVICE_FQDN_TRIGGER}
      APP_ORIGIN: ${SERVICE_FQDN_TRIGGER}
      DEV_OTEL_EXPORTER_OTLP_ENDPOINT: '$SERVICE_FQDN_TRIGGER/otel'
      ELECTRIC_ORIGIN: 'http://electric:3000'
    networks:
      - trigger
    depends_on:
      postgresql:
        condition: service_healthy
      redis:
        condition: service_healthy
      electric:
        condition: service_healthy
    healthcheck:
      test: "timeout 10s bash -c ':> /dev/tcp/127.0.0.1/3000' || exit 1"
      interval: 10s
      timeout: 5s
      retries: 5

  docker-provider:
    image: ghcr.io/triggerdotdev/provider/docker:v3
    platform: linux/amd64
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    user: root
    networks:
      - trigger
    depends_on:
      trigger:
        condition: service_healthy
    environment:
      HTTP_SERVER_PORT: 9020
      PLATFORM_HOST: trigger
      PLATFORM_WS_PORT: 3000
      PLATFORM_SECRET: $PROVIDER_SECRET
      SECURE_CONNECTION: false
      COORDINATOR_HOST: 127.0.0.1
      COORDINATOR_PORT: 9020
      DOCKER_NETWORK: trigger
      REGISTRY_HOST: $DEPLOY_REGISTRY_HOST
      REGISTRY_NAMESPACE: $DEPLOY_REGISTRY_NAMESPACE
      FORCE_CHECKPOINT_SIMULATION: 0
      ENFORCE_MACHINE_PRESETS: true
      OTEL_EXPORTER_OTLP_ENDPOINT: '$SERVICE_FQDN_TRIGGER/otel'
    healthcheck:
      test:
        - CMD
        - node
        - '-e'
        - "require('http').get('http://127.0.0.1:9020/health', (r) => {if (r.statusCode !== 200) process.exit(1); else process.exit(0); }).on('error', () => process.exit(1))"
      interval: 5s

  coordinator:
    image: ghcr.io/triggerdotdev/coordinator:v3
    platform: linux/amd64
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    user: root
    networks:
      - trigger
    ports:
      - '127.0.0.1:9020:9020'
    depends_on:
      trigger:
        condition: service_healthy
    environment:
      HTTP_SERVER_PORT: 9020
      PLATFORM_HOST: trigger
      PLATFORM_WS_PORT: 3000
      PLATFORM_SECRET: $PROVIDER_SECRET
      SECURE_CONNECTION: false
      COORDINATOR_HOST: 127.0.0.1
      COORDINATOR_PORT: 9020
      REGISTRY_HOST: $DEPLOY_REGISTRY_HOST
      REGISTRY_NAMESPACE: $DEPLOY_REGISTRY_NAMESPACE
      FORCE_CHECKPOINT_SIMULATION: 0
      OTEL_EXPORTER_OTLP_ENDPOINT: '$SERVICE_FQDN_TRIGGER/otel'
    healthcheck:
      test:
        - CMD
        - node
        - '-e'
        - "require('http').get('http://127.0.0.1:9020/health', (r) => {if (r.statusCode !== 200) process.exit(1); else process.exit(0); }).on('error', () => process.exit(1))"
      interval: 5s

  electric:
    image: electricsql/electric:latest
    environment:
      DATABASE_URL: 'postgres://$POSTGRES_USER:$POSTGRES_PASSWORD@postgresql:5432/$POSTGRES_DB?sslmode=disable'
    networks:
      - trigger
    depends_on:
      postgresql:
        condition: service_healthy
    healthcheck:
      test: 'curl --fail http://127.0.0.1:3000/v1/health || exit 1'
      interval: 10s
      retries: 5
      start_period: 10s
      timeout: 10s

  redis:
    image: redis:7
    networks:
      - trigger
    healthcheck:
      test:
        - CMD-SHELL
        - 'redis-cli ping | grep PONG'
      interval: 1s
      timeout: 3s
      retries: 5
    volumes:
      - redis-data:/data

  postgresql:
    image: postgres:16-alpine
    volumes:
      - postgresql-data:/var/lib/postgresql/data/
    networks:
      - trigger
    environment:
      POSTGRES_USER: $POSTGRES_USER
      POSTGRES_PASSWORD: $POSTGRES_PASSWORD
      POSTGRES_DB: $POSTGRES_DB
    command:
      - -c
      - wal_level=logical
    healthcheck:
      test:
        - CMD-SHELL
        - 'pg_isready -U $${POSTGRES_USER} -d $${POSTGRES_DB}'
      interval: 5s
      timeout: 20s
      retries: 10

volumes:
  postgresql-data:
  redis-data:

networks:
  trigger:
    name: trigger
    external: true

murshudov avatar Feb 13 '25 18:02 murshudov

his is a self-hosted deployment

Sorry forgot to mention in the title 🙏

rharkor avatar Feb 13 '25 18:02 rharkor

because of this run gets s

I am using the app and worker separately, so I don't really know which one you want, also to mention that everything works fin 99.5% of the time.

rharkor avatar Feb 13 '25 18:02 rharkor

because of this run gets s

I am using the app and worker separately, so I don't really know which one you want, also to mention that everything works fin 99.5% of the time.

I am running them on the same server, but would be great to see any working configuration. Pulling my hair for the last 3 days)

murshudov avatar Feb 13 '25 18:02 murshudov

because of this run gets s

I am using the app and worker separately, so I don't really know which one you want, also to mention that everything works fin 99.5% of the time.

I am running them on the same server, but would be great to see any working configuration. Pulling my hair for the last 3 days)

Okay so this is my config:

docker-compose.webapp.yml

x-env: &webapp-env
  LOGIN_ORIGIN: https://${TRIGGER_DOMAIN:?Please set this in your .env file}
  APP_ORIGIN: https://${TRIGGER_DOMAIN}
  DEV_OTEL_EXPORTER_OTLP_ENDPOINT: https://${TRIGGER_DOMAIN}/otel
  ELECTRIC_ORIGIN: http://electric:3000

volumes:
  postgres-data:
  redis-data:

networks:
  default:

services:
  webapp:
    image: ghcr.io/triggerdotdev/trigger.dev:${TRIGGER_IMAGE_TAG:-v3}
    restart: ${RESTART_POLICY:-unless-stopped}
    env_file:
      - .env
    environment:
      <<: *webapp-env
    ports:
      - ${WEBAPP_PUBLISH_IP:-127.0.0.1}:3040:3030
    depends_on:
      - postgres
      - redis
    networks:
      - default

  postgres:
    image: postgres:${POSTGRES_IMAGE_TAG:-16}
    restart: ${RESTART_POLICY:-unless-stopped}
    volumes:
      - postgres-data:/var/lib/postgresql/data/
    env_file:
      - .env
    networks:
      - default
    ports:
      - ${DOCKER_PUBLISH_IP:-127.0.0.1}:5433:5432
    command:
      - -c
      - wal_level=logical

  redis:
    image: redis:${REDIS_IMAGE_TAG:-7}
    restart: ${RESTART_POLICY:-unless-stopped}
    volumes:
      - redis-data:/data
    networks:
      - default
    ports:
      - ${DOCKER_PUBLISH_IP:-127.0.0.1}:6389:6379

  electric:
    image: electricsql/electric:${ELECTRIC_IMAGE_TAG:-latest}
    restart: ${RESTART_POLICY:-unless-stopped}
    environment:
      DATABASE_URL: $DATABASE_URL
    networks:
      - default
    depends_on:
      - postgres
    ports:
      - ${DOCKER_PUBLISH_IP:-127.0.0.1}:3061:3000

docker-compoe.worker.yml

x-env: &worker-env
  PLATFORM_HOST: ${TRIGGER_DOMAIN:?Please set this in your .env file}
  PLATFORM_WS_PORT: 443
  SECURE_CONNECTION: "true"
  OTEL_EXPORTER_OTLP_ENDPOINT: https://${TRIGGER_DOMAIN}/otel

networks:
  default:

services:
  docker-provider:
    image: ghcr.io/triggerdotdev/provider/docker:${TRIGGER_IMAGE_TAG:-v3}
    restart: ${RESTART_POLICY:-unless-stopped}
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    user: root
    networks:
      - default
    ports:
      - ${DOCKER_PUBLISH_IP:-127.0.0.1}:9021:9020
    env_file:
      - .env
    environment:
      <<: *worker-env
      PLATFORM_SECRET: $PROVIDER_SECRET

  coordinator:
    image: ghcr.io/triggerdotdev/coordinator:${TRIGGER_IMAGE_TAG:-v3}
    restart: ${RESTART_POLICY:-unless-stopped}
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    user: root
    networks:
      - default
    ports:
      - ${DOCKER_PUBLISH_IP:-127.0.0.1}:9020:9020
    env_file:
      - .env
    environment:
      <<: *worker-env
      PLATFORM_SECRET: $COORDINATOR_SECRET

rharkor avatar Feb 13 '25 18:02 rharkor

@murshudov did you manage to solve this? I'm also having the same issue.

eth0izzle avatar May 17 '25 09:05 eth0izzle

@eth0izzle Yes, it is working for us quite well. I forgot all steps required, but can share updated compose file if needed

murshudov avatar May 17 '25 09:05 murshudov

@eth0izzle Also our deployment is on Coolify

murshudov avatar May 17 '25 09:05 murshudov

@murshudov yes also using coolify which I think where the issue is but I'm tearing my hair out. Your compose file would be much appreciated!

eth0izzle avatar May 17 '25 09:05 eth0izzle

@eth0izzle I will try my best to explain.

docker-compose.yml

services:
  trigger:
    image: ghcr.io/triggerdotdev/trigger.dev:v3
    environment:
      SERVICE_FQDN_TRIGGER_3030:
      PORT: 3030
      REMIX_APP_PORT: 3030
      NODE_ENV: production
      RUNTIME_PLATFORM: docker-compose
      V3_ENABLED: true
      TRIGGER_TELEMETRY_DISABLED: 1
      INTERNAL_OTEL_TRACE_DISABLED: 1
      INTERNAL_OTEL_TRACE_LOGGING_ENABLED: 0
      POSTGRES_USER: $POSTGRES_USER
      POSTGRES_PASSWORD: $POSTGRES_PASSWORD
      POSTGRES_DB: $POSTGRES_DB
      MAGIC_LINK_SECRET: $MAGIC_LINK_SECRET
      SESSION_SECRET: $SESSION_SECRET
      ENCRYPTION_KEY: $ENCRYPTION_KEY
      PROVIDER_SECRET: $PROVIDER_SECRET
      COORDINATOR_SECRET: $COORDINATOR_SECRET
      DATABASE_URL: 'postgres://$POSTGRES_USER:$POSTGRES_PASSWORD@postgres:5432/$POSTGRES_DB?sslmode=disable'
      DIRECT_URL: 'postgres://$POSTGRES_USER:$POSTGRES_PASSWORD@postgres:5432/$POSTGRES_DB?sslmode=disable'
      REDIS_HOST: redis
      REDIS_PORT: 6379
      REDIS_TLS_DISABLED: true
      COORDINATOR_HOST: 127.0.0.1
      COORDINATOR_PORT: 9020
      WHITELISTED_EMAILS: ''
      ADMIN_EMAILS: $ADMIN_EMAILS
      DEFAULT_ORG_EXECUTION_CONCURRENCY_LIMIT: 300
      DEFAULT_ENV_EXECUTION_CONCURRENCY_LIMIT: 100
      DEPLOY_REGISTRY_HOST: $DEPLOY_REGISTRY_HOST
      DEPLOY_REGISTRY_NAMESPACE: $DEPLOY_REGISTRY_NAMESPACE
      EMAIL_TRANSPORT: $EMAIL_TRANSPORT
      FROM_EMAIL: $FROM_EMAIL
      REPLY_TO_EMAIL: $REPLY_TO_EMAIL
      SMTP_HOST: $SMTP_HOST
      SMTP_PORT: $SMTP_PORT
      SMTP_SECURE: $SMTP_SECURE
      SMTP_USER: $SMTP_USER
      SMTP_PASSWORD: $SMTP_PASSWORD
      LOGIN_ORIGIN: ${SERVICE_FQDN_TRIGGER}
      APP_ORIGIN: ${SERVICE_FQDN_TRIGGER}
      DEV_OTEL_EXPORTER_OTLP_ENDPOINT: '$SERVICE_FQDN_TRIGGER_3030/otel'
      ELECTRIC_ORIGIN: 'http://electric:3000'
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
      electric:
        condition: service_healthy
    healthcheck:
      test: "timeout 10s bash -c ':> /dev/tcp/127.0.0.1/3030' || exit 1"
      interval: 10s
      timeout: 5s
      retries: 5

  docker-provider:
    image: ghcr.io/triggerdotdev/provider/docker:v3
    platform: linux/amd64
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    user: root
    depends_on:
      trigger:
        condition: service_healthy
    environment:
      HTTP_SERVER_PORT: 9020
      PLATFORM_HOST: trigger
      PLATFORM_WS_PORT: 3030
      PLATFORM_SECRET: $PROVIDER_SECRET
      SECURE_CONNECTION: false
      COORDINATOR_HOST: coordinator
      COORDINATOR_PORT: 9020
      FORCE_CHECKPOINT_SIMULATION: 0
      DOCKER_NETWORK: $DOCKER_NETWORK
      ENFORCE_MACHINE_PRESETS: true
      OTEL_EXPORTER_OTLP_ENDPOINT: '$SERVICE_FQDN_TRIGGER_3030/otel'
    healthcheck:
      test:
        - CMD
        - node
        - '-e'
        - "require('http').get('http://127.0.0.1:9020/health', (r) => {if (r.statusCode !== 200) process.exit(1); else process.exit(0); }).on('error', () => process.exit(1))"
      interval: 5s

  coordinator:
    image: ghcr.io/triggerdotdev/coordinator:v3
    platform: linux/amd64
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    user: root
    depends_on:
      trigger:
        condition: service_healthy
    environment:
      HTTP_SERVER_PORT: 9020
      PLATFORM_HOST: trigger
      PLATFORM_WS_PORT: 3030
      PLATFORM_SECRET: $COORDINATOR_SECRET
      SECURE_CONNECTION: false
      REGISTRY_HOST: $DEPLOY_REGISTRY_HOST
      REGISTRY_NAMESPACE: $DEPLOY_REGISTRY_NAMESPACE
      FORCE_CHECKPOINT_SIMULATION: 0
      OTEL_EXPORTER_OTLP_ENDPOINT: '$SERVICE_FQDN_TRIGGER_3030/otel'
    healthcheck:
      test:
        - CMD
        - node
        - '-e'
        - "require('http').get('http://127.0.0.1:9020/health', (r) => {if (r.statusCode !== 200) process.exit(1); else process.exit(0); }).on('error', () => process.exit(1))"
      interval: 5s

  electric:
    image: electricsql/electric:latest
    environment:
      DATABASE_URL: 'postgres://$POSTGRES_USER:$POSTGRES_PASSWORD@postgres:5432/$POSTGRES_DB?sslmode=disable'
      ELECTRIC_INSECURE: true
    depends_on:
      postgres:
        condition: service_healthy
    healthcheck:
      test: 'curl --fail http://127.0.0.1:3000/v1/health || exit 1'
      interval: 10s
      retries: 5
      start_period: 10s
      timeout: 10s

  redis:
    image: redis:7
    healthcheck:
      test:
        - CMD-SHELL
        - 'redis-cli ping | grep PONG'
      interval: 1s
      timeout: 3s
      retries: 5
    volumes:
      - redis-data:/data

  postgres:
    image: postgres:16-alpine
    volumes:
      - postgres-data:/var/lib/postgresql/data/
    environment:
      POSTGRES_USER: $POSTGRES_USER
      POSTGRES_PASSWORD: $POSTGRES_PASSWORD
      POSTGRES_DB: $POSTGRES_DB
    command:
      - -c
      - wal_level=logical
    healthcheck:
      test:
        - CMD-SHELL
        - 'pg_isready -U $${POSTGRES_USER} -d $${POSTGRES_DB}'
      interval: 5s
      timeout: 20s
      retries: 10

volumes:
  postgres-data:
  redis-data:

Env vars for Coolify:

SERVICE_FQDN_TRIGGER=https://trigger-dev.example.com
SERVICE_FQDN_TRIGGER_3030=https://trigger-dev.example.com

DEPLOY_REGISTRY_HOST=docker.io
DEPLOY_REGISTRY_NAMESPACE=your docker hub user name
DOCKER_NETWORK=rkscwg0c048gc4wcs8og48o4 # This one is network name created by coolify. It is in your trigger service url at the end: /service/rkscwg0c048gc4wcs8og48o4

POSTGRES_DB=trigger
POSTGRES_PASSWORD=your postgres password
POSTGRES_USER=trigger

[email protected]
EMAIL_TRANSPORT=smtp
[email protected]
[email protected]
SMTP_HOST=email-smtp.eu-west-2.amazonaws.com
SMTP_PORT=465
SMTP_SECURE=true
SMTP_USER=AAAAZLNAJ5X6YVUHHPPP
SMTP_PASSWORD=your smpt password

# Lengths are important!!! 32 and 64 chars
MAGIC_LINK_SECRET=cccce85a1f7b9dbbbbeeeefcb234aaaa
SESSION_SECRET=cccce85a1f7b9dbbbbeeeefcb234aaaa
ENCRYPTION_KEY=cccce85a1f7b9dbbbbeeeefcb234aaaa
COORDINATOR_SECRET=oG2iIa0pkqNIu2E7Dr0hLNa7i3OnXxBbUbawz3ZoG2iIa0pkqNIu2E7Dr0hLNa7i
PROVIDER_SECRET=oG2iIa0pkqNIu2E7Dr0hLNa7i3OnXxBbUbawz3ZoG2iIa0pkqNIu2E7Dr0hLNa7i

These are working files for us in production. It is a single server setup.

So issue was the following:

docker-provider creates worker containers for your tasks. But those containers get connected to host network by default and coordinator can't reach them, because coordinator is inside the network created by Coolify. By specifying DOCKER_NETWORK on the docker-provider we create containers attached to the network created by Coolify so everyone can talk to anyone on this network.

I may miss some details, but everything is created inside Coolify network by default, except the dynamically created worker containers. By this change introduced by @Mortalife these worker also get connected to Coolify network and everyone is happy.

Hope this helps.

murshudov avatar May 17 '25 10:05 murshudov

I'm going to close this as we're going to release v4 self-hosting in the next couple of weeks. Link to the excellent (unofficial) coolify self-hosting docs: https://github.com/Mortalife/trigger-dev-coolify

nicktrn avatar May 19 '25 13:05 nicktrn