Home Discussion Intel AI DevCloud qsub PBS Project name -P parameter error

This topic contains 6 replies, has 2 voices, and was last updated by  laserbled 1 week, 5 days ago.

Viewing 7 posts - 1 through 7 (of 7 total)
  • Author
    Posts
  • #8616

    laserbled
    Participant

    qsub: submit error (Bad UID for job execution MSG=User ‘u22724’ is attempting to submit a proxy job for user ‘u22724’ but is not a manager)

    I am trying to run a job with the following conf file :

    command qsub -V -v PATH -S /bin/bash -P u22724 -l mem=4G
    option mem=* -l mem=$0
    option mem=0 # Do not add anything to qsub_opts
    option num_threads=* -l ncpus=$0
    option num_threads=1 # Do not add anything to qsub_opts

    Here I am not sure what the -P parameter should be. Where can I get the Project name from as I understand that is what is required to pass to qsub
    ~

    #8617

    Andrey
    Keymaster

    You don’t need the -P argument. Please see https://access.colfaxresearch.com/?p=compute for instructions and examples of job submission.

    #8618

    laserbled
    Participant

    I have a J param in the job which is forcing me to have a P param. I tried to run it without P but it says
    [The -J option can only be used in conjunction with -P]

    Following is the error Log without P:

    pbs.pl: error submitting jobs to queue (return status was 512)
    queue log file is exp/make_mfcc/train/q/make_mfcc_train.log,

    command was qsub -v PATH -S /bin/bash -l mem=4G -o exp/make_mfcc/train/q/make_mfcc_train.log -l mem=4G -J 1-10 /home/u22724/experiments/kaldi/egs/timit/s5/exp/make_mfcc/train/q/make_mfcc_train.sh >>exp/make_mfcc/train/q/make_mfcc_train.log 2>&1

    [The -J option can only be used in conjunction with -P]

    usage: qsub [-a date_time] [-A account_string] [-b secs]
    [-c [ none | { enabled | periodic | shutdown |
    depth=<int> | dir=<path> | interval=<minutes>}… ]
    [-C directive_prefix] [-d path] [-D path]
    [-e path] [-h] [-I] [-j oe|eo|n] [-k {oe}] [-l resource_list] [-m n|{abe}]
    [-M user_list] [-N jobname] [-o path] [-p priority] [-P proxy_user [-J <jobid]]
    [-q queue] [-r y|n] [-S path] [-t number_to_submit] [-T type] [-u user_list]
    [-w] path
    [-W additional_attributes] [-v variable_list] [-V ] [-x] [-X] [-z] [script]

    #8621

    Andrey
    Keymaster

    It seems that our version of “qsub” does not support the -J argument. I am not sure what you are trying to accomplish with this. Perhaps you need to submit a job array? Then the -t argument may work. See this page for the syntax of our version of qsub: http://docs.adaptivecomputing.com/torque/5-1-3/help.htm#topics/torque/commands/qsub.htm

    #8627

    laserbled
    Participant

    Yes, thank you. -t worked for me. Now am able to run the job but found it slower than if I were to run it inside a single node using qsub interactive mode.

    I have a query regarding the number of jobs. I understand that nodes=5:ppn=2 is the limit for AI dev cloud.

    So does that mean I can only run 2 cores per node totaling to 10 on the whole?

    I believe if I run the same job inside a node I get access to 24 cores.

    Is my understanding wrong?

    #8628

    Andrey
    Keymaster

    Each compute node in the AI DevCloud has 12 physical cores (which is 24 logical CPUs due to 2-way hyper-threading). Each node is logically partitioned in the resource management system into 2 “seats”, so each seat is 6 physical cores. When you are accessing the cloud via the Jupyter Hub, you get one “seat”. When you submit jobs to the queue, you have to grab both seats in each node so that you don’t have “roommates”, so ppn=2 gets you physical 12 cores in each node.

    As to why your non-interactive jobs run slower than when you use the interactive mode, the number of cores per node is not to blame because it is the same in these two cases. A lot of other things can go wrong, but my first guess would be that by using 5 nodes in non-interactive instead of 1 node in interactive mode, your application is slowing down due to too much communication. As of today, in the AI DevCloud, the interconnect between the nodes is only a 1 GbE network. So multi-node runs are mostly useful for code validation, but in many cases, you cannot expect a speedup from using more than 1 node. If you are using the Intel MPI for communication, you can measure communication relative to computation by using the “-mps” argument to “mpirun”.

    At the same time, if you can take advantage of multiple independent single-node jobs, that is the ideal way to use the AI DevCloud.

    #8630

    laserbled
    Participant

    great :). That explained a lot of stuff neatly.

    Its unfortunate that we cant make use of the cluster because of network bottle neck to run jobs.

    I will stick to single node executions. I was under the impression that we would be able to run jobs faster in a cluster configuration

    Any chances that we can choose higher network configuration to measure the perfomance. I read somewhere that there are few systems with optic fiber connectivity. Any way we can know which of them has these ?

Viewing 7 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic.