A: Please see the documentation section of our website for SGE basic usage & introduction (OpenOffice presentation)
Q: A node is inaccessible since it is flagged as "in Error" state. See example below. How to fix?
A: Ensure that the underlying problem has been solved (eg: hardware problem, network problem), then:
[root ~]# qstat -f|more queuename qtype used/tot. load_avg arch states ---------------------------------------------------------------------------- all.q@c00 BIP 0/2 0.00 lx24-amd64 ---------------------------------------------------------------------------- all.q@c02 BIP 0/2 0.00 lx24-amd64 E ---------------------------------------------------------------------------- all.q@c03 BIP 1/2 1.00 lx24-amd64
Issuing "qmod -c all.q@c02" will clear the error state and make the node available for further job runs.
[root@ ~]# qmod -c all.q@c02 root@clu1 changed state of "all.q@c02" (no error) [root@ ~]# qstat -f|more queuename qtype used/tot. load_avg arch states ---------------------------------------------------------------------------- all.q@c00 BIP 0/2 0.00 lx24-amd64 ---------------------------------------------------------------------------- all.q@c02 BIP 0/2 0.00 lx24-amd64
http://gridengine.info/articles/2006/09/19/sge-gets-registered-iana-port-numbers Please use the following section in your /etc/services:
sge_qmaster 6444/tcp Grid Engine Qmaster Service sge_qmaster 6444/udp Grid Engine Qmaster Service sge_execd 6445/tcp Grid Engine Execution Service sge_execd 6445/udp Grid Engine Execution Service
See http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html Assuming a homogeneous cluster, On your master, run: qconf -mconf and change the SGE defaults from:
qlogin_command telnet qlogin_daemon /usr/sbin/in.telnetd rlogin_daemon /usr/sbin/in.rlogind
Delete these lines and add the following:
rsh_daemon /usr/sbin/sshd -i rlogin_daemon /usr/sbin/sshd -i qlogin_daemon /usr/sbin/sshd -i rsh_command /usr/bin/ssh rlogin_command /usr/bin/ssh qlogin_command /var/sge/ql.sh
where ql.sh is the qlogin_wrapper script and looks like this:
#!/bin/sh HOST=$1 PORT=$2 /usr/bin/ssh -X -p $PORT $HOST
Note that ql.sh must be available at the same pathname for all nodes and upon saving the config, it is active immediately. Ensure that the users' ssh key pairs and authorized_keys have been prepared to accept passwordless logins from any-to-any node. Here's a sample session:
-sh-3.00$ source /var/sge/vmx86/common/settings.sh
-sh-3.00$ qstat -f
queuename qtype used/tot. load_avg arch states
----------------------------------------------------------------------------
all.q@cos43x86-c00 BIP 0/1 0.02 lx24-x86
----------------------------------------------------------------------------
all.q@cos43x86-c01 BIP 1/1 0.05 lx24-x86
43 0.55500 QLOGIN demo00 r 06/13/2006 22:41:51 1
----------------------------------------------------------------------------
all.q@cos43x86-c02 BIP 1/1 0.03 lx24-x86
45 0.55500 QLOGIN demo00 r 06/13/2006 22:42:15 1
-sh-3.00$ qlogin
Your job 46 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 46 has been successfully scheduled.
Establishing /var/sge/ql.sh session to host cos43x86-c00 ...
Last login: Mon Jun 12 17:25:08 2006 from cos43x86-c01
-sh-3.00$
The following failed since telnet-server is not running on the compute node:
-bash-3.00$ qlogin
Your job 12 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 12 has been successfully scheduled.
Establishing telnet session to host c02 ...
Trying 192.168.230.12...
Connected to c02 (192.168.230.12).
Escape character is '^]'.
Connection closed by foreign host.
telnet exited with exit code 1
-bash-3.00$
[root@accdemo ~]# ssh head Last login: Thu Oct 12 19:57:46 2006 [root@head ~]# uname -a Linux head 2.6.9-42.EL #1 Tue Aug 15 09:30:48 BST 2006 x86_64 x86_64 x86_64 GNU/Linux [root@head ~]# qhost HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS ------------------------------------------------------------------------------- global - - - - - - - c00 lx24-amd64 1 - 119.4M - 256.0M - c01 lx24-amd64 1 - 88.0M - 256.0M - c02 lx24-amd64 1 0.08 119.4M 19.4M 256.0M 0.0 [root@head ~]# qconf -de c01 Host object "c01" is still referenced in cluster queue "all.q". [root@head ~]# qconf -mhgrp "@allhosts" root@head modified "@allhosts" in host group list [root@head ~]# qconf -shgrp "@allhosts" group_name @allhosts hostlist c00 [root@head ~]# qconf -de c01 root@head removed "c01" from execution host list [root@head ~]# qhost HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS ------------------------------------------------------------------------------- global - - - - - - - c00 lx24-amd64 1 - 119.4M - 256.0M - c02 lx24-amd64 1 0.07 119.4M 19.4M 256.0M 0.0 [root@head ~]#
[root@head ~]# qconf -mhgrp "@allhosts" root@head modified "@allhosts" in host group list [root@head ~]# qconf -shgrp "@allhosts" group_name @allhosts hostlist c00 c01 c02 [root@head ~]# ssh c01 "/etc/init.d/sgeexecd stop ; /etc/init.d/sgeexecd start" Shutting down Grid Engine execution daemon starting sge_execd [root@head ~]# qhost HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS ------------------------------------------------------------------------------- global - - - - - - - c00 lx24-amd64 1 - 119.4M - 256.0M - c01 lx24-amd64 1 0.10 88.0M 18.0M 256.0M 0.0 c02 lx24-amd64 1 0.03 119.4M 19.3M 256.0M 0.0 [root@head ~]#
[root@shark IBQ]# qconf -shgrpl @allhosts @ibhosts [root@shark IBQ]# qconf -shgrp @ibhosts group_name @ibhosts hostlist shark-c00 shark-c01 shark-c02 shark-c03 shark-c04 shark-c05 shark-c06 \ shark-c07 shark-c08 shark-c09 shark-c10 shark-c11 shark-c12 shark-c13 \ shark-c14 shark-c16 shark-c17 shark-c18 shark-c19 shark-c20 shark-c21 \ shark-c22
[root@shark IBQ]# qconf -spl lam-eth make mpich-eth mpich-ib [root@shark IBQ]# qconf -sp mpich-ib pe_name mpich-ib slots 999 user_lists NONE xuser_lists NONE start_proc_args /var/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile stop_proc_args /var/sge/mpi/stopmpi.sh allocation_rule $fill_up control_slaves TRUE job_is_first_task FALSE urgency_slots min [root@shark IBQ]#
[root@shark IBQ]# cat IBQ qname ib.q hostlist @ibhosts seq_no 0 load_thresholds np_load_avg=4 suspend_thresholds NONE nsuspend 1 suspend_interval 00:05:00 priority 0 min_cpu_interval 00:05:00 processors UNDEFINED qtype BATCH INTERACTIVE ckpt_list NONE pe_list mpich-ib rerun FALSE slots 0,[shark-c00=2],[shark-c01=2], \ [shark-c02=2],[shark-c03=2], \ [shark-c04=2],[shark-c05=2], \ [shark-c06=2],[shark-c07=2], \ [shark-c08=2],[shark-c09=2], \ [shark-c10=2],[shark-c11=2], \ [shark-c12=2],[shark-c13=2], \ [shark-c14=2],[shark-c15=2], \ [shark-c16=2],[shark-c17=2], \ [shark-c18=2],[shark-c19=2], \ [shark-c20=2],[shark-c21=2], \ [shark-c22=2] tmpdir /tmp shell /bin/sh prolog NONE epilog NONE shell_start_mode posix_compliant starter_method NONE suspend_method NONE resume_method NONE terminate_method NONE notify 00:00:60 owner_list NONE user_lists NONE xuser_lists NONE subordinate_list NONE complex_values NONE projects NONE xprojects NONE calendar NONE initial_state default s_rt INFINITY h_rt INFINITY s_cpu INFINITY h_cpu INFINITY s_fsize INFINITY h_fsize INFINITY s_data INFINITY h_data INFINITY s_stack INFINITY h_stack INFINITY s_core INFINITY h_core INFINITY s_rss INFINITY h_rss INFINITY s_vmem INFINITY h_vmem INFINITY [root@shark IBQ]#
qconf -Aq IBQ
Commands:
By default and always, use the following submission command format. Especially if your script has #BSUB directives, please use:
bsub < script.txt
if not, optionally use:
bsub script.txt
List all (including, EXIT, DONE and RUN, PEND, SUSP status)
bjobs -a -u USERNAME
Sample script file:
#!/bin/sh#BSUB -q QNAME#BSUB -o %J.OUT#BSUB -e %J.ERR# BSUB -J JOBNAME#BSUB -W hh:mmmyexecutable myargs1 myarg2