spark-knowledgebase icon indicating copy to clipboard operation
spark-knowledgebase copied to clipboard

Create article for "Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory"

Open JoshRosen opened this issue 11 years ago • 37 comments

We should create an article for "Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory", since that seems to be a common issue.

JoshRosen avatar Nov 20 '14 01:11 JoshRosen

One cause of this error is network connectivity issues between the master and driver, so maybe we should also add a note to https://github.com/databricks/spark-knowledgebase/blob/master/troubleshooting/connectivity_issues.md

JoshRosen avatar Nov 20 '14 04:11 JoshRosen

Just got bitten by an environment variable set by another person for EXECUTOR_MEMORY to a large value. I was using spark-submit with --executor-memory 3G but the env var took precedence.

Do you think the explicit argument should take precedence?

jkleckner avatar Jan 16 '15 03:01 jkleckner

Hi @jkleckner,

We've deprecated most environment variables in favor of the newer configuration mechanisms, so system properties and SparkSubmit / SparkConf settings are intended to take precedence over environment variables. Which version of Spark are you using? Do you have a simple reproduction for this issue? If so, do you mind filing a JIRA ticket and linking it here? https://issues.apache.org/jira/browse/SPARK

JoshRosen avatar Jan 18 '15 23:01 JoshRosen

Sorry, I found that someone else had explicitly programmed environment vars to override config values....

jkleckner avatar Jan 19 '15 03:01 jkleckner

Sorry, I found that someone else had explicitly programmed environment vars to override config values....

You mean in your own application / user-code, you have code that reads from the environment variable and uses it to set the corresponding SparkConf setting, or something like that?

JoshRosen avatar Jan 19 '15 19:01 JoshRosen

Yes, in our programming someone intentionally made it work that way. Obviously I will be changing that. So to quote Emily Litella, never mind...

jkleckner avatar Jan 19 '15 22:01 jkleckner

I am facing this error from 4 days and no one seems to be able to figure out a fix for it. Could you please suggest something, I reduced my input data size from 1TB to 1GB to 10 simple records. I still get the same error, making me believe that this error is occurring at request time and not execution time.

deepujain avatar Mar 01 '15 13:03 deepujain

@deepujain, If you are using YARN, bring up the page NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING Applications by browsing on the master node port 9026 (what AWS EMR uses but can vary) as in http://127.0.0.1:9026/cluster . Examine the nodes and the queues to see if there is an old application zombie around. If so kill it with:

yarn application -list
yarn application -kill <jobid>

Some situations can lead to old jobs hanging around and using up resources.

jkleckner avatar Mar 01 '15 16:03 jkleckner

Thanks so much for opening this issue!

I was having issues setting up a spark on mesos dev environment for the last few days and had made zero headway until I set spark.mesos.coarse to true and then lowered the spark.executor.memory below the default 512 value (running on m1.smalls on EC2 here). Couldn't even finish running /bin/run-example SparkPi 10 and was ready to give up until I saw this.

rodriguezsergio avatar Mar 06 '15 22:03 rodriguezsergio

This is excellent. I actually had a zombie Mesos Spark app, killed that, and now I am back in business--well done, guys!

hokiegeek2 avatar Mar 21 '15 10:03 hokiegeek2

@hokiegeek2 glad you found it.

Recently, I found that spark jobs could hang because exceptions didn't pass up to an exit and added this snippet. Now the testing process doesn't result in strewn bodies in the cluster...

    try {
      Foo.runAnalysis(sc, debug = true)
    } catch {
      case e: Exception => Seq[String]()
      println(e)
      sc.stop
      sys.exit(1)
    }

jkleckner avatar Mar 22 '15 16:03 jkleckner

+1

gdubicki avatar Jun 08 '15 10:06 gdubicki

+1

zheolong avatar Dec 09 '15 14:12 zheolong

@rodriguezsergio now,I have the same issues, spark on mesos,when i Create task

TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 16/01/22 15:31:25 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

My driver is only on the master, the actuator is also

xuedihualu avatar Jan 22 '16 07:01 xuedihualu

I have a similar problem.

When I run a code on the "spark-shell", it works just fine.

However, a similar code written in eclipse and then deployed to spark master fails (no resources are assigned).

I've posted a stackoverflow post about this.

Thanks

vinesinha avatar Feb 26 '16 22:02 vinesinha

Does it not allow multi application running in parallel? After I exit one ,the problem disappears

ghost avatar May 18 '16 13:05 ghost

@deepujain, If you are using YARN, bring up the page NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING Applications by browsing on the master node port 9026 (what AWS EMR uses but can vary) as in http://127.0.0.1:9026/cluster . Examine the nodes and the queues to see if there is an old application zombie around. If so kill it with: yarn application -list yarn application -kill Some situations can lead to old jobs hanging around and using up resources.

I'm using YARN and in my case the cluster is idle with all resources free to be assigned...

We use Azkaban lo enqueue a long list of processes into EMR Steps (each with one spark-submit job). I launch my queue one day and the next day, when I return to the office I find that some of the jobs have completed and one of them is stooped and waiting for 15 hours to receive resources from YARN. No other YARN process going on at the moment. All jobs request the same amount of resources.

Then I kill the queue, I relaunch the queue and the same job that had been waiting run without problems...

Any ideas?

ggalmazor avatar Jun 14 '16 05:06 ggalmazor

You can check your cluster's work node cores ,then your application can't exceed that. For examle,you have two work node .And per work node have 4 cores. Then you have 2 applications to run. So you can set every application 4 core to run the job. You can set like this in the code : SparkConf sparkConf = new SparkConf().setAppName("JianSheJieDuan").set("spark.cores.max","4");

yomige avatar Jun 17 '16 03:06 yomige

Thanks @iwwenbo. When this happens, there is memory and cores enough for the task. We have determined that the problem is triggered by an Exception in the worker container that Spark is unable to recover from. This is the stacktrace:

16/06/16 13:58:53 ERROR executor.CoarseGrainedExecutorBackend: Cannot register with driver: akka.tcp://[email protected]:36230/user/CoarseGrainedScheduler
java.lang.NullPointerException
    at org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef$lzycompute(AkkaRpcEnv.scala:273)
    at org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef(AkkaRpcEnv.scala:273)
    at org.apache.spark.rpc.akka.AkkaRpcEndpointRef.toString(AkkaRpcEnv.scala:313)
    at java.lang.String.valueOf(String.java:2994)
    at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:197)
    at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125)
    at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125)
    at org.apache.spark.Logging$class.logInfo(Logging.scala:59)
    at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.logInfo(CoarseGrainedSchedulerBackend.scala:69)
    at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:125)
    at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:178)
    at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:127)
    at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:198)
    at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:126)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
    at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
    at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
    at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
    at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
    at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
    at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:93)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
    at akka.actor.ActorCell.invoke(ActorCell.scala:487)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
    at akka.dispatch.Mailbox.run(Mailbox.scala:220)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

This error prevents the worker to register with the driver and stalls. We're running everything on a single node.

ggalmazor avatar Jun 20 '16 07:06 ggalmazor

@xuedihualu I have the same problem. Had you solve it?

ToniYang avatar Jun 26 '16 04:06 ToniYang

@ggalmazor I have the same problem. It seems just one salve machine out of 4 does the entire job. I don't know where this happens. When I took that particular machine out (shutdown) I couldn't even run the shell (pyspark --master yarn-client). How did you fix this?

fchgithub avatar Aug 12 '16 22:08 fchgithub

@fchgithub we haven't solved it yet. We are currently running a crontab'ed script that detects this failures and forces the termination of the YARN applications.

ggalmazor avatar Aug 22 '16 10:08 ggalmazor

+1

clrke avatar Sep 28 '16 07:09 clrke

Run into the same problem, any solutions yet?

nvdhaider avatar Nov 03 '16 22:11 nvdhaider

@ToniYang Hi,just Available memory deficiency!!!!

xuedihualu avatar Nov 04 '16 01:11 xuedihualu

My problem was raised by a confusion on starting up Spark. When I start it in master mode, I should start at least a slave (by run sbin/start-slave.sh) to construct a worker in it so as to use the cpu cores and memory resources, otherwise this error rises.

For each work, I assigned 4 cpu cores ( by export spark_worker_cores in conf/spark-env.sh) and 10g memory ( spark_worker_memory) and everything's ok.

Just for reference.

alexwwang avatar Nov 04 '16 02:11 alexwwang

similar issue. I have sufficient resources (core and memory) but resource manager (yarn) is not able to execute my job. suspect it is due to worker not being registered.

harishmaiya avatar May 01 '17 01:05 harishmaiya

Hi I am kind of facing the same issue. I am deploying prediction.io on a multinode cluster where training should happen on the worker node. The worker node has been successfully registered with the master.

following are the logs of after starting slaves.sh Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 18/05/22 06:01:44 INFO Worker: Started daemon with process name: 2208@ip-172-31-6-235 18/05/22 06:01:44 INFO SignalUtils: Registered signal handler for TERM 18/05/22 06:01:44 INFO SignalUtils: Registered signal handler for HUP 18/05/22 06:01:44 INFO SignalUtils: Registered signal handler for INT 18/05/22 06:01:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 18/05/22 06:01:44 INFO SecurityManager: Changing view acls to: ubuntu 18/05/22 06:01:44 INFO SecurityManager: Changing modify acls to: ubuntu 18/05/22 06:01:44 INFO SecurityManager: Changing view acls groups to: 18/05/22 06:01:44 INFO SecurityManager: Changing modify acls groups to: 18/05/22 06:01:44 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); groups with view permissions: Set(); users with modify permissions: Set(ubuntu); groups with modify permissions: Set() 18/05/22 06:01:44 INFO Utils: Successfully started service 'sparkWorker' on port 45057. 18/05/22 06:01:44 INFO Worker: Starting Spark worker 172.31.6.235:45057 with 8 cores, 24.0 GB RAM 18/05/22 06:01:44 INFO Worker: Running Spark version 2.1.1 18/05/22 06:01:44 INFO Worker: Spark home: /home/ubuntu/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6 18/05/22 06:01:45 INFO Utils: Successfully started service 'WorkerUI' on port 8081. 18/05/22 06:01:45 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://172.31.6.235:8081 18/05/22 06:01:45 INFO Worker: Connecting to master ip-172-31-5-119.ap-southeast-1.compute.internal:7077... 18/05/22 06:01:45 INFO TransportClientFactory: Successfully created connection to ip-172-31-5-119.ap-southeast-1.compute.internal/172.31.5.119:7077 after 19 ms (0 ms spent in bootstraps) 18/05/22 06:01:45 INFO Worker: Successfully registered with master spark://ip-172-31-5-119.ap-southeast-1.compute.internal:7077

Now the issues:

  1. if I launch one slave on master and one slave my other node: 1.1 if the slave of the master node is given fewer resources it will give some unable to re-shuffle error. 1.2 if I give more resources to the worker on the master node the all the execution happens on master node, it does not send any execution to the slave node.
  2. If I do not start a slave on the master node: 2.1 I get the following error: WARN] [TaskSchedulerImpl] Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

I have assigned 24gb ram to the worker and 8 cores.

However, while I start the process following are the logs I get on slave machine: 18/05/22 06:16:00 INFO Worker: Asked to launch executor app-20180522061600-0001/0 for PredictionIO Training: com.actionml.RecommendationEngine 18/05/22 06:16:00 INFO SecurityManager: Changing view acls to: ubuntu 18/05/22 06:16:00 INFO SecurityManager: Changing modify acls to: ubuntu 18/05/22 06:16:00 INFO SecurityManager: Changing view acls groups to: 18/05/22 06:16:00 INFO SecurityManager: Changing modify acls groups to: 18/05/22 06:16:00 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); groups with view permissions: Set(); users with modify permissions: Set(ubuntu); groups with modify permissions: Set() 18/05/22 06:16:00 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-8-oracle/bin/java" "-cp" "./:/home/ubuntu/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/conf/:/home/ubuntu/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/jars/*" "-Xmx4096M" "-Dspark.driver.port=45049" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://[email protected]:45049" "--executor-id" "0" "--hostname" "172.31.6.235" "--cores" "8" "--app-id" "app-20180522061600-0001" "--worker-url" "spark://[email protected]:45057" 18/05/22 06:16:50 INFO Worker: Asked to kill executor app-20180522061600-0001/0 18/05/22 06:16:50 INFO ExecutorRunner: Runner thread for executor app-20180522061600-0001/0 interrupted 18/05/22 06:16:50 INFO ExecutorRunner: Killing process! 18/05/22 06:16:51 INFO Worker: Executor app-20180522061600-0001/0 finished with state KILLED exitStatus 143 18/05/22 06:16:51 INFO Worker: Cleaning up local directories for application app-20180522061600-0001 18/05/22 06:16:51 INFO ExternalShuffleBlockResolver: Application app-20180522061600-0001 removed, cleanupLocalDirs = true

Can somebody help me debuging the issue? Thanks!

umesh1989 avatar May 22 '18 06:05 umesh1989

+1

namangt68 avatar Jan 12 '19 16:01 namangt68

Any update?

I submit 2 spark jobs on a cluster with two workers each with 4 CPUs and 14GB memory.

My config: driver.memory = 1GB executor.memory=8GB executor.cores = 2 executor.instances=1.

It's weird that some times two jobs can run concurrently but some times one job fails with "Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory"

nianglao avatar Mar 04 '19 03:03 nianglao