Talk:Apache Spark

From Alliance Doc
Jump to navigation Jump to search

I found the below script to work better than that posted on the Spark page. There are only a couple of modifications.

  • Added the -s option to the hostname command. This causes the hostname command to return the short form of the hostname (e.g. cdr355) which seems to work better with the srun -x option.
  • Omitted the srun stop-slave.sh. It appears that srun waits for the previous srun command which started the slaves to complete before issuing this command. This means that if the job finishes early then it will wait the total allocated time before being killed by the scheduler. This could greatly increase someones account billing if they are overly conservative with their run times. Instead I simply let slurm stop the spark slaves when the end of the script is reached. I have checked that no process are left on the nodes after these jobs have completed. In addition running "stop-slave.sh" in the original version doesn't help anyway, so omitting it does not lose any functionality.
  • removed the export -n HOSTNAME it didn't appear to have an effect.


File : pyspark_submit.sh

#!/bin/bash
#SBATCH --account=def-someuser
#SBATCH --time=01:00:00
#SBATCH --nodes=3
#SBATCH --mem=2000M
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=1
#SBATCH --job-name=multi-node-spark
#SBATCH --output=%x-%j.out

module load spark

export SPARK_IDENT_STRING=$SLURM_JOBID
export SPARK_WORKER_DIR=$SLURM_TMPDIR
start-master.sh

(
export SPARK_NO_DAEMONIZE=1;

srun -x $(hostname -s) -n $((SLURM_NTASKS -1)) --label --output=$SPARK_LOG_DIR/spark-$SPARK_IDENT_STRING-workers.out \
     start-slave.sh -m ${SLURM_MEM_PER_NODE}M -c ${SLURM_CPUS_PER_TASK} spark://$(hostname -f):7077
) &

spark-submit --executor-memory ${SLURM_MEM_PER_NODE}M $SPARK_HOME/examples/src/main/python/pi.py 100000

stop-master.sh


When I run jobs with the above script, or the one on the main page, I nearly always get this error:

slurmstepd: error: Exceeded step memory limit at some point.

In fact I am not sure of any job that did not produce this error. I checked memory usage with:

$ sacct -j 1681245
    JobIDRaw      State              Submit               Start AllocNodes  AllocCPUS     MaxRSS  MaxVMSize  Partition    Account   Priority
------------ ---------- ------------------- ------------------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
1681245       COMPLETED 2017-10-10T10:50:08 2017-10-10T10:50:09          3          3                       cpubase_b+ def-cgero+     408632
1681245.bat+  COMPLETED 2017-10-10T10:50:09 2017-10-10T10:50:09          1          1    451480K  10682800K            def-cgero+
1681245.0     CANCELLED 2017-10-10T10:50:12 2017-10-10T10:50:12          2          2      1944K    371196K            def-cgero+

but it appears MaxRSS is well below the requested 2000M per node with 0.43GB for the master, and ~1.9MB for slaves. So I am a bit puzzled with this message. I suppose it is possible to have a short lived spike in memory usage not accounted for in MaxRSS size since there is a frequency (something like 15 seconds) with which memory is polled by Slurm. Perhaps this large spike is what is referred to by the error message, but there is no way that I know of to be sure. That error message suggests, according to this Slurm forum post that some amount of memory is being moved to swap. So it may suggest some issues with performance, but seems less likely to cause any errors with the actual results. In my test cases the results appear fine.

Chris Geroux (talk) 18:07, 10 October 2017 (UTC)

Note that there is no Spark/2.2.0 available on Cedar/Graham currently, only 2.1.0 and 2.1.1 and the submit script as is fails. Additionally, it seems the shutdown doesn't work correctly. When srun stop-slave.sh is called, it seems to sit there for quit some time, about 7 minutes or so, then I get "srun: Job step creation temporarily disabled, retrying". According to this page: https://bugs.schedmd.com/show_bug.cgi?id=119 it seems possible that this occurs because the previous srun on the slaves hasn't finished? I suppose this might be true if it some how counts it as still running because of the processes that start-slave.sh starts which wouldn't have finished, but the start-slave.sh script its self should be finished. I have attempted using the "-O" option which suggests it might allow for oversubscription, but this might be within a single srun call rather than accross srun commands. It may require manually sshing to nodes and running start-slave.sh. A list of node names is probably in $SLURM_JOB_NODELIST. I will try this out next week.

Chris Geroux (talk) 18:12, 15 September 2017 (UTC)