SGE Notes

An introduction to SGE usage?

A: Please see the documentation section of our website for SGE basic usage & introduction (OpenOffice presentation)

Q: A node is inaccessible since it is flagged as "in Error" state. See example below. How to fix?

A: Ensure that the underlying problem has been solved (eg: hardware problem, network problem), then:

[root ~]# qstat -f|more
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@c00                BIP   0/2       0.00     lx24-amd64
----------------------------------------------------------------------------
all.q@c02                BIP   0/2       0.00     lx24-amd64    E
----------------------------------------------------------------------------
all.q@c03                BIP   1/2       1.00     lx24-amd64

Issuing "qmod -c all.q@c02" will clear the error state and make the node available for further job runs.

[root@ ~]# qmod -c all.q@c02
root@clu1 changed state of "all.q@c02" (no error)
[root@ ~]# qstat -f|more
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@c00                BIP   0/2       0.00     lx24-amd64
----------------------------------------------------------------------------
all.q@c02                BIP   0/2       0.00     lx24-amd64

FlexLM Integration

SGE ports have been standardised by IANA.

http://gridengine.info/articles/2006/09/19/sge-gets-registered-iana-port-numbers Please use the following section in your /etc/services:

sge_qmaster	6444/tcp   Grid Engine Qmaster Service
sge_qmaster	6444/udp   Grid Engine Qmaster Service
sge_execd	6445/tcp   Grid Engine Execution Service
sge_execd	6445/udp   Grid Engine Execution Service

How to use SSH for qrsh, qlogin, qsh

See http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html Assuming a homogeneous cluster, On your master, run: qconf -mconf and change the SGE defaults from:

qlogin_command               telnet
qlogin_daemon                /usr/sbin/in.telnetd
rlogin_daemon                /usr/sbin/in.rlogind

Delete these lines and add the following:

rsh_daemon                   /usr/sbin/sshd -i
rlogin_daemon                /usr/sbin/sshd -i
qlogin_daemon                /usr/sbin/sshd -i
rsh_command                  /usr/bin/ssh
rlogin_command               /usr/bin/ssh
qlogin_command               /var/sge/ql.sh

where ql.sh is the qlogin_wrapper script and looks like this:

#!/bin/sh
HOST=$1
PORT=$2
/usr/bin/ssh -X -p $PORT $HOST

Note that ql.sh must be available at the same pathname for all nodes and upon saving the config, it is active immediately. Ensure that the users' ssh key pairs and authorized_keys have been prepared to accept passwordless logins from any-to-any node. Here's a sample session:

-sh-3.00$ source /var/sge/vmx86/common/settings.sh
-sh-3.00$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@cos43x86-c00             BIP   0/1       0.02     lx24-x86
----------------------------------------------------------------------------
all.q@cos43x86-c01             BIP   1/1       0.05     lx24-x86
43 0.55500 QLOGIN     demo00       r     06/13/2006 22:41:51     1
----------------------------------------------------------------------------
all.q@cos43x86-c02             BIP   1/1       0.03     lx24-x86
45 0.55500 QLOGIN     demo00       r     06/13/2006 22:42:15     1
-sh-3.00$ qlogin
Your job 46 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 46 has been successfully scheduled.
Establishing /var/sge/ql.sh session to host cos43x86-c00 ...
Last login: Mon Jun 12 17:25:08 2006 from cos43x86-c01
-sh-3.00$

The following failed since telnet-server is not running on the compute node:

-bash-3.00$ qlogin
Your job 12 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 12 has been successfully scheduled.
Establishing telnet session to host c02 ...
Trying 192.168.230.12...
Connected to c02 (192.168.230.12).
Escape character is '^]'.
Connection closed by foreign host.
telnet exited with exit code 1
-bash-3.00$   

Deleing exec nodes from SGE

[root@accdemo ~]# ssh head
Last login: Thu Oct 12 19:57:46 2006
[root@head ~]# uname -a
Linux head 2.6.9-42.EL #1 Tue Aug 15 09:30:48 BST 2006 x86_64 x86_64 x86_64 GNU/Linux
[root@head ~]# qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
c00                     lx24-amd64      1     -  119.4M       -  256.0M       -
c01                     lx24-amd64      1     -   88.0M       -  256.0M       -
c02                     lx24-amd64      1  0.08  119.4M   19.4M  256.0M     0.0
[root@head ~]# qconf -de c01
Host object "c01" is still referenced in cluster queue "all.q".
[root@head ~]# qconf -mhgrp "@allhosts"
root@head modified "@allhosts" in host group list
[root@head ~]# qconf -shgrp "@allhosts"
group_name @allhosts
hostlist c00
[root@head ~]# qconf -de c01
root@head removed "c01" from execution host list
[root@head ~]# qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
c00                     lx24-amd64      1     -  119.4M       -  256.0M       -
c02                     lx24-amd64      1  0.07  119.4M   19.4M  256.0M     0.0
[root@head ~]#

Adding back a deleted Node

[root@head ~]# qconf -mhgrp "@allhosts"
root@head modified "@allhosts" in host group list
[root@head ~]# qconf -shgrp "@allhosts"
group_name @allhosts
hostlist c00 c01 c02
[root@head ~]# ssh c01 "/etc/init.d/sgeexecd stop ; /etc/init.d/sgeexecd start"
Shutting down Grid Engine execution daemon
starting sge_execd
[root@head ~]# qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
c00                     lx24-amd64      1     -  119.4M       -  256.0M       -
c01                     lx24-amd64      1  0.10   88.0M   18.0M  256.0M     0.0
c02                     lx24-amd64      1  0.03  119.4M   19.3M  256.0M     0.0
[root@head ~]#                                                                           

Adding a new SGE Que for InfiniBand:

  • Create a list of hosts:
    	[root@shark IBQ]# qconf -shgrpl
    	@allhosts
    	@ibhosts
    	[root@shark IBQ]# qconf -shgrp @ibhosts
    	group_name @ibhosts
    	hostlist shark-c00 shark-c01 shark-c02 shark-c03 shark-c04 shark-c05 shark-c06 \
    	shark-c07 shark-c08 shark-c09 shark-c10 shark-c11 shark-c12 shark-c13 \
    	shark-c14 shark-c16 shark-c17 shark-c18 shark-c19 shark-c20 shark-c21 \
    	shark-c22
    	
  • Define a parallel environment:
    	[root@shark IBQ]# qconf -spl
    	lam-eth
    	make
    	mpich-eth
    	mpich-ib
    	[root@shark IBQ]# qconf -sp mpich-ib
    	pe_name           mpich-ib
    	slots             999
    	user_lists        NONE
    	xuser_lists       NONE
    	start_proc_args   /var/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile
    	stop_proc_args    /var/sge/mpi/stopmpi.sh
    	allocation_rule   $fill_up
    	control_slaves    TRUE
    	job_is_first_task FALSE
    
    	urgency_slots     min
    	[root@shark IBQ]#            
    	
  • Create a queue template file:
    	[root@shark IBQ]# cat IBQ
    	qname                 ib.q
    	hostlist              @ibhosts
    	seq_no                0
    	load_thresholds       np_load_avg=4
    	suspend_thresholds    NONE
    	nsuspend              1
    	suspend_interval      00:05:00
    	priority              0
    	min_cpu_interval      00:05:00
    	processors            UNDEFINED
    	qtype                 BATCH INTERACTIVE
    	ckpt_list             NONE
    	pe_list               mpich-ib
    	rerun                 FALSE
    	slots                 0,[shark-c00=2],[shark-c01=2], \
    	[shark-c02=2],[shark-c03=2], \
    	[shark-c04=2],[shark-c05=2], \
    	[shark-c06=2],[shark-c07=2], \
    	[shark-c08=2],[shark-c09=2], \
    	[shark-c10=2],[shark-c11=2], \
    	[shark-c12=2],[shark-c13=2], \
    	[shark-c14=2],[shark-c15=2], \
    	[shark-c16=2],[shark-c17=2], \
    	[shark-c18=2],[shark-c19=2], \
    	[shark-c20=2],[shark-c21=2], \
    	[shark-c22=2]
    	tmpdir                /tmp
    	shell                 /bin/sh
    	prolog                NONE
    	epilog                NONE
    	shell_start_mode      posix_compliant
    	starter_method        NONE
    	suspend_method        NONE
    	resume_method         NONE
    	terminate_method      NONE
    	notify                00:00:60
    	owner_list            NONE
    	user_lists            NONE
    	xuser_lists           NONE
    	subordinate_list      NONE
    	complex_values        NONE
    	projects              NONE
    	xprojects             NONE
    	calendar              NONE
    	initial_state         default
    	s_rt                  INFINITY
    	h_rt                  INFINITY
    	s_cpu                 INFINITY
    	h_cpu                 INFINITY
    	s_fsize               INFINITY
    	h_fsize               INFINITY
    	s_data                INFINITY
    	h_data                INFINITY
    	s_stack               INFINITY
    	h_stack               INFINITY
    	s_core                INFINITY
    	h_core                INFINITY
    	s_rss                 INFINITY
    	h_rss                 INFINITY
    	s_vmem                INFINITY
    	h_vmem                INFINITY
    	[root@shark IBQ]#
    	
  • Add the queue:
    	qconf -Aq IBQ
    	

 


 

 

Notes on LSF

 

Commands:

By default and always, use the following submission command format. Especially if your script has #BSUB directives, please use:

	bsub < script.txt
	

if not, optionally use:

	bsub script.txt
	

List all (including, EXIT, DONE and RUN, PEND, SUSP status)

	bjobs -a -u USERNAME
	

Sample script file:

	#!/bin/sh
	
	#BSUB -q QNAME
	
	#BSUB -o %J.OUT
	
	#BSUB -e %J.ERR
	
	# BSUB -J JOBNAME
	
	#BSUB -W hh:mm 
	
	myexecutable myargs1 myarg2