For example you discover via a CMS SAM page you are failing some test (could equally be any other SAM page such as LHCb) , you click on the detailed out put and see a reference to the job id:
on t2ce05 contains the string: sOFavxScVKU-GbSYaCmx-A
grep sOFavxScVKU-GbSYaCmx-A /opt/edg/var/gatekeeper/grid-jobmap_20100906
reveals the batch system job id: lrmsID=2998805.t2torque02.physics.ox.ac.uk
on the batch server t2torque02 in our case, either:
grep 2998805 /var/spool/pbs/server_logs/20100909
The tracejob option is easier!
This will let you know which worker node ran the job. You can then have a look at it to check for full disks, memory faults etc or segfaults in the log files......
Now in reverse
A job is misbehaving on your node and you need to see who is running it.
The special case here is that its an ATLAS pilot job, this does not have a normal grid job id.
Get the PID from top, use
pstree -H pid
to highlight the processes parents.
(Use pstree -A -H pid if on an putty window on Windows)
This reveals which pbs job it is
The job can be traced on the panda monitor, using the search facility on the LH toolbar.
This gives the job details including the users name. A GGUS ticket could then be raised against ATLAS asking for the user to be informed.