SGE Notes

An introduction to SGE usage?

A: Please see the documentation section of our website for SGE basic usage & introduction (OpenOffice presentation)

Q: A node is inaccessible since it is flagged as "in Error" state. See example below. How to fix?

A: Ensure that the underlying problem has been solved (eg: hardware problem, network problem), then:

[root ~]# qstat -f|more
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@c00                BIP   0/2       0.00     lx24-amd64
----------------------------------------------------------------------------
all.q@c02                BIP   0/2       0.00     lx24-amd64    E
----------------------------------------------------------------------------
all.q@c03                BIP   1/2       1.00     lx24-amd64

Issuing "qmod -c all.q@c02" will clear the error state and make the node available for further job runs.

[root@ ~]# qmod -c all.q@c02
root@clu1 changed state of "all.q@c02" (no error)
[root@ ~]# qstat -f|more
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@c00                BIP   0/2       0.00     lx24-amd64
----------------------------------------------------------------------------
all.q@c02                BIP   0/2       0.00     lx24-amd64

FlexLM Integration

SGE ports have been standardised by IANA.

http://gridengine.info/articles/2006/09/19/sge-gets-registered-iana-port-numbers Please use the following section in your /etc/services:

sge_qmaster	6444/tcp   Grid Engine Qmaster Service
sge_qmaster	6444/udp   Grid Engine Qmaster Service
sge_execd	6445/tcp   Grid Engine Execution Service
sge_execd	6445/udp   Grid Engine Execution Service

How to use SSH for qrsh, qlogin, qsh

See http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html Assuming a homogeneous cluster, On your master, run: qconf -mconf and change the SGE defaults from:

qlogin_command               telnet
qlogin_daemon                /usr/sbin/in.telnetd
rlogin_daemon                /usr/sbin/in.rlogind

Delete these lines and add the following:

rsh_daemon                   /usr/sbin/sshd -i
rlogin_daemon                /usr/sbin/sshd -i
qlogin_daemon                /usr/sbin/sshd -i
rsh_command                  /usr/bin/ssh
rlogin_command               /usr/bin/ssh
qlogin_command               /var/sge/ql.sh

where ql.sh is the qlogin_wrapper script and looks like this:

#!/bin/sh
HOST=$1
PORT=$2
/usr/bin/ssh -X -p $PORT $HOST

Note that ql.sh must be available at the same pathname for all nodes and upon saving the config, it is active immediately. Ensure that the users' ssh key pairs and authorized_keys have been prepared to accept passwordless logins from any-to-any node. Here's a sample session:

-sh-3.00$ source /var/sge/vmx86/common/settings.sh
-sh-3.00$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@cos43x86-c00             BIP   0/1       0.02     lx24-x86
----------------------------------------------------------------------------
all.q@cos43x86-c01             BIP   1/1       0.05     lx24-x86
43 0.55500 QLOGIN     demo00       r     06/13/2006 22:41:51     1
----------------------------------------------------------------------------
all.q@cos43x86-c02             BIP   1/1       0.03     lx24-x86
45 0.55500 QLOGIN     demo00       r     06/13/2006 22:42:15     1
-sh-3.00$ qlogin
Your job 46 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 46 has been successfully scheduled.
Establishing /var/sge/ql.sh session to host cos43x86-c00 ...
Last login: Mon Jun 12 17:25:08 2006 from cos43x86-c01
-sh-3.00$

The following failed since telnet-server is not running on the compute node:

-bash-3.00$ qlogin
Your job 12 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 12 has been successfully scheduled.
Establishing telnet session to host c02 ...
Trying 192.168.230.12...
Connected to c02 (192.168.230.12).
Escape character is '^]'.
Connection closed by foreign host.
telnet exited with exit code 1
-bash-3.00$   

Deleing exec nodes from SGE

[root@accdemo ~]# ssh head
Last login: Thu Oct 12 19:57:46 2006
[root@head ~]# uname -a
Linux head 2.6.9-42.EL #1 Tue Aug 15 09:30:48 BST 2006 x86_64 x86_64 x86_64 GNU/Linux
[root@head ~]# qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
c00                     lx24-amd64      1     -  119.4M       -  256.0M       -
c01                     lx24-amd64      1     -   88.0M       -  256.0M       -
c02                     lx24-amd64      1  0.08  119.4M   19.4M  256.0M     0.0
[root@head ~]# qconf -de c01
Host object "c01" is still referenced in cluster queue "all.q".
[root@head ~]# qconf -mhgrp "@allhosts"
root@head modified "@allhosts" in host group list
[root@head ~]# qconf -shgrp "@allhosts"
group_name @allhosts
hostlist c00
[root@head ~]# qconf -de c01
root@head removed "c01" from execution host list
[root@head ~]# qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
c00                     lx24-amd64      1     -  119.4M       -  256.0M       -
c02                     lx24-amd64      1  0.07  119.4M   19.4M  256.0M     0.0
[root@head ~]#

Adding back a deleted Node

[root@head ~]# qconf -mhgrp "@allhosts"
root@head modified "@allhosts" in host group list
[root@head ~]# qconf -shgrp "@allhosts"
group_name @allhosts
hostlist c00 c01 c02
[root@head ~]# ssh c01 "/etc/init.d/sgeexecd stop ; /etc/init.d/sgeexecd start"
Shutting down Grid Engine execution daemon
starting sge_execd
[root@head ~]# qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
c00                     lx24-amd64      1     -  119.4M       -  256.0M       -
c01                     lx24-amd64      1  0.10   88.0M   18.0M  256.0M     0.0
c02                     lx24-amd64      1  0.03  119.4M   19.3M  256.0M     0.0
[root@head ~]#                                                                           

Adding a new SGE Que for InfiniBand:

 


 

 

Notes on LSF

 

Commands:

By default and always, use the following submission command format. Especially if your script has #BSUB directives, please use:

	bsub < script.txt
	

if not, optionally use:

	bsub script.txt
	

List all (including, EXIT, DONE and RUN, PEND, SUSP status)

	bjobs -a -u USERNAME
	

Sample script file:

	#!/bin/sh
	
	#BSUB -q QNAME
	
	#BSUB -o %J.OUT
	
	#BSUB -e %J.ERR
	
	# BSUB -J JOBNAME
	
	#BSUB -W hh:mm 
	
	myexecutable myargs1 myarg2