System Operations

IT-ST-FDO

Index:

- Bash pag. 2

- Common Operations pag. 3

- EOS pag. 13

- Filesystem Operations pag. 16

- CASTOR pag. 23

- Interventions pag. 28

- How to implement the SSO on EOSCOCKPIT Machine pag. 34

- XrdFed pag. 38

- Rundeck pag. 40

- Gitlab pag. 41

- SAMBA pag. 45

Please remind that commands and procedures could be no longer updated. Please verify them before any use.

Bash: Shell command line --> “command” “options” “arguments”.

- scp [source] [dest] --> scp log. root@lxbst2277:/etc/.conf --> Secure copy to other machines.

- [options]... Source Dest --> Copy Source to Dest or Directory.

- lp [options]...[file...] --> Send files to a printer.

- [Options] [Directory] --> Change Directory - change the current working directory to a specific folder.

- [-LP] --> Print Working Directory.

- [Options]... [File]... --> List information about files.

- ll = ls -l [file] --> List directory contents using long list format.

- [Options] [File]... --> Concatenate and print (display) the content of files.

- [options] PATTERN [FILE...] --> Search file(s) for specific text.

- [options] [file...] --> Sort text files. Sort, merge, or compare all lines from files given.

- [OPTION]... [FILE]... --> Divide a file into several parts (columns).

- [options]... SET1 [SET2] --> Translate, squeeze, and/or delete characters.

- [options]... Source... Directory --> Move or rename files or directories.

- source filename [arguments] --> Read and execute commands from the filename argument in the current shell context.

- [Options] folder... --> Create new folder(s), if they do not already exist.

- xrdcp [options] source destination --> Copies one or files from one location to another. - [option]... [file]... --> Disk Free - display free disk space.

- [-p] [name[=value] ...] --> Create an alias that substitut a string with a word.

- 'Program' Input-File1 Input-File2 --> and Replace text, database sort/validate/index.

- fs [Options --> la, sa, lq] [Directory] --> Permit to see can access to files or directories on AFS.

- [options]... file... --> Remove files (delete/unlink).

- man [command name] --> Format and display help pages.

- ssh --> ssh -l root [machine_name] or ssh [user@][hostname] --> OpenSSH SSH client (remote login program).

- kinit [-R] --> Requests renewal of the ticket-granting ticket.

- history --> Command Line history. → to remove a line from the history → history -d [line.number]

- service [hostname or script-name] COMMAND --> Show the status or start, stop, and restart the daemons and other services.

COMMAND = start – stop – status - restart

- vimdiff [options] [file1] [file2] --> Edit one, two or more version of a file and show differences.

- uptime --> Tell how long the system has been running.

- /dev/urandom --> It is a random number generator to create random files.

- less [path][script] --> See all code of the script

- Parsing --> [script_path]... --[parameter1]... --[parameter2]... --> How to a parsing of a script.

- CTRL+R --> Reversed search

- locate [word]--> Useful to find where are the logs

- iftop – iotop – iostat --> Network monitoring

- -a --> check name and so version of the machine

- ifconfig → Check information in the machine, IP, MAC address, etc...

- eos-disk-menu.py

- uname -r → check kernel version of a machine (if you want to compare what is in GRUB → cat /boot/grub/grub.conf)

- service iptables stop → Stop the firewall

- wall → to send messages to everybody connected in one machine

- who → to see who is connected in the console

- last --> show last logged in users, crashes and reboots

Common Operations:

How to create a new directory and an empty file:

[root@lxfsrd06c03 ~]# mkdir /etc/castor/

[root@lxfsrd06c03 ~]# -n > /etc/castor/castor.conf

Special Characters:

# --> Comment. * --> String wildcard. \ --> Quote next character.

> --> Output redirect. < --> Input redirect. & --> Background job:

[command..] > [file] & --> Send output in background to a file.

| --> Pipe: redirect the output of a command into the standard input of another command instead of a file.

; --> Shell command separator, permits putting two or more commands on the same line.

` --> Back tick: everything you between backticks is evaluated (executed) by the shell before the main command.

Automated Operations:

- for i in [host1] [host2] [host3] ...; do echo “[command] $i”; done | sh

Parallel Login:

- wassh -t 99999 -l root [host1],[host2],[host3] “[command]”

File Permission:

- ls -l /afs/.../file.key --> Check setted permissions.

- [600] /afs/.../file.key --> Set file permissions.

Me Group Other

RWX RWX RWX

4 2 1 ------> /afs/.../file.key

6 0 0

How to a script executable:

- chmod +x [script_path]

Phyton Debug --> PDB or strace -p [PID]

Service Now (SNOW) assignments:

In case of Hardware problems → Repair Services

In case of problems → Sys Admins Team

In case of a network cable problem → Facility operation (search the FE in snow)

How to recover AFS home directory from backup:

This procedure is for the WORK space. Concerning the user space have a look on the link below.

~/tmp$ fs mkm -dir -vol work.$USER.backup ~/tmp$ ls test (copy stuff) eventually ~/tmp$ fs rmm -dir test https://cern.service-now.com/service-portal/article.do?n=KB0000430

To recover single files written after 6:30 pm of the previous day it is needed to use → afs restore

Check information about users and groups:

- getent [user]

- getent group [GID]

- ls -l [path]

- id [user]

- groups [user]

- getacl [path]

- user:group [path] --> change ownership

- ACLs → http://eos.readthedocs.org/en/latest/configuration/permission.html

How to see if a machine is an headnode:

- cat /etc/sysconfig/eos

- nslookup [ip]

How to transform a vertical list in an horizontal one: for i in `cat /afs/cern.ch/user/a/afiorot/LISTS/abc`; do echo "$i "| tr '\n' ' '; done

How to check the status of all the nodes in an instance:

[afiorot@aiadm054 ~]$ for i in `wassh - eoslhcb --list`; do roger show $i 2>/dev/null | tr '\n' ' ' | tr '[{}]:,' ' ';echo; done | awk '{print $1, $2, $3, $4, $9, $10, $8, $13, $14, $15, $16, $17, $18, $19, $20}' | sort

for i in `wassh -cl eos/cms/storage --list`; do roger show $i 2>/dev/null | tr '\n' ' ' | tr '[{}]:,' ' ';echo; done

2>/dev/null --> if there are errors in the output the commmand delete them tr '\n' '|' --> If there are more variables in column, this command put them in one line tr " " "\n" --> If I have variables on the same lines, this command put them in column

[root@p05508916e55509 ~]# cat /afs/cern.ch/user/a/afiorot/LISTS/ldap | tr '\n' '|' | 's/||/\n/g' | tr '|' ' ' | grep -E "gidNumber|uidNumber"

Service or process problems: if there is some problem or error messages with services or processes, restart it --> service [srvname] restart

When service goes down:

-Check which kind of error is form IT Status Board

How to do some statistic:

[afiorot@c2atlassrv301 ~]$ zgrep atlspecial /var/log/castor/stagerd.log-201412* | grep atnight | tr ' ' '\n' | grep Filename | sort | -c | sort -nr

[afiorot@c2atlassrv301 ~]$ zgrep atlspecial /var/log/castor/stagerd.log-201412* | grep atnight | tr ' ' '\n' | grep Type | sort | uniq -c | sort -nr

How to see the mapping of a user: [afiorot@c2atlassrv301 ~]$ grep atlas003 /etc/grid-security/grid-mapfile

How to add a machine in Puppet: echo "c2repack-2.cern.ch 128.142.37.94 C8-60-00-1B-D8-D6" | ai-foreman addhost -o "SLC 6.7" --architecture "x86_64" --hostgroup "castor/c2repack/headnode" --environment "production" --ptable "Castor_MDRaid1_SystemOnSmallestDisks" -m "SLC"

How to check the list of files are present in a CASTOR pool: /afs/cern.ch/project/castor/www/DiskPoolDump/lhcb.lhcbdisk.last.gz

How to create directory in EOS and list them:

- eos mkdir /eos/castorlhcb-decommission/lhcbt3.remains

- eos ls -l /eos/castorlhcb-decommission/ drwxrwsr-+ 1 afiorot c3 1 Jun 03 17:17 castor drwxrwsr-+ 1 root root 2 Sep 14 15:31 lhcbdisk.remains drwxrwsr-+ 1 root root 0 Sep 16 14:08 lhcbt3.remains

How to sum a list of number:

…......

17268959

942202576

289531559

[afiorot@wlustagemanu ~]$ awk '{ sum += $1 } END { print sum }' /afs/cern.ch/user/a/afiorot/list.txt

11780693487319

LEMON METRICS:

Exceptions:

If there is some unusual exception:  search for it in --> ls /etc/lemon/agent/metrics/ | grep [exception]  take the ID  look in the https://metricmgr.cern.ch

How to read a metric:

In case I want to know how the following Notification works --> exception.packetsDropped

Find the notification in metricmgr.cern.ch –> look for the correlation, that in this case is:

Correlation: ((13367:1 eq 'interface') || (13367:3 > 300 ) || (13367:5 > 300))

The correlation explain how the exception works and which counters are taken in consideration.

13367:1 → the first field indicate the number of the metric, while the second one the index of that metric have a look the Metrics page in metricmgr.cern.ch → Open the Metric class for details

In the correlation are taken into account 3 fields (1-3-5) that correspond to:

1 InterfaceName

3 NumReceivedDroppedLastInterval

5 NumTransmittedDroppedLastInterval

At this point we know what the correlation means: if “NumReceivedDroppedLastInterval” or “NumTransmittedDroppedLastInterval” are more than 300 during 5 min of interval, the notification will be raised.

- To check that the interval is 5 min → login in metricmgr → open the notification –> Edit → Period (300 sec in this case)

How to disable a metric temporary: lemon-host-check -d ID --duration=2678400 --reason="Investigation ongoing"

How to delete a JBOD configuration from a machine:

- df -kh

- hwraidman | grep JBOD --> put the output in a echo “” - echo “output” | hwraidman destroy

- hwraidman info

In case a machine in CASTOR has no Mountpoints:

- login in

- df

- lsscsi

1) Write many xfs_admin as many partition. It shows the disk that exist: xfs_admin -l /dev/sdb1; xfs_admin -l /dev/sdc1; xfs_admin -l /dev/sdd1; xfs_admin -l /dev/sde1; xfs_admin -l /dev/sdf1; xfs_admin -l /dev/sdg1; xfs_admin -l /dev/sdh1; xfs_admin -l /dev/sdi1; xfs_admin -l /dev/sdj1; xfs_admin -l /dev/sdk1; xfs_admin -l /dev/sdl1; xfs_admin - l /dev/sdm1; xfs_admin -l /dev/sdn1; xfs_admin -l /dev/sdo1; xfs_admin -l /dev/sdp1; xfs_admin -l /dev/sdq1; xfs_admin -l /dev/sdr1

2) If xfs_admin doesn't work --> yum install -y xfsprogs

Send the xfs_admin command again...

3) Create the directories : mkdir /srv/castor;for i in `seq 1 9`;do mkdir /srv/castor/0$i; done;for i in `seq 10 16`;do mkdir /srv/castor/$i; done

4) Mount the filesystem:

[root@lxfsre02b01 ~]# mount /dev/sdb1 /srv/castor/01; mount /dev/sdc1 /srv/castor/02; mount /dev/sdd1 /srv/castor/03; mount /dev/sde1 /srv/castor/04; mount /dev/sdf1 /srv/castor/05; mount /dev/sdg1 /srv/castor/06; mount /dev/sdh1 /srv/castor/07; mount /dev/sdi1 /srv/castor/08; mount /dev/sdj1 /srv/castor/09; mount /dev/sdk1 /srv/castor/10; mount /dev/sdl1 /srv/castor/11; mount /dev/sdm1 /srv/castor/12; mount /dev/sdn1 /srv/castor/13; mount /dev/sdo1 /srv/castor/14; mount /dev/sdp1 /srv/castor/15; mount /dev/sdq1 /srv/castor/16

5) Check if the Mountpoints are present in the castor.conf --> vim /etc/castor/castor.conf

If not, add the Mountpoints.

DiskManager MountPoints /srv/castor/01/ /srv/castor/02/ /srv/castor/03/ /srv/castor/04/ /srv/castor/05/ /srv/castor/06/ /srv/castor/07/ /srv/castor/08/ /srv/castor/09/ /srv/castor/10/ /srv/castor/11/

6) Restart the diskmanagerd service --> service diskmanagerd restart

7) Check if the mountpoint are present in the fstab --> less /etc/fstab

LABEL=castor01 /srv/castor/01 xfs defaults,logbsize=256k,logbufs=8,inode64,noatime,swalloc 1 3

LABEL=castor02 /srv/castor/02 xfs defaults,logbsize=256k,logbufs=8,inode64,noatime,swalloc 1 3

LABEL=castor03 /srv/castor/03 xfs defaults,logbsize=256k,logbufs=8,inode64,noatime,swalloc 1 3

LABEL=castor04 /srv/castor/04 xfs defaults,logbsize=256k,logbufs=8,inode64,noatime,swalloc 1 3

......

LABEL=castor09 /srv/castor/09 xfs defaults,logbsize=256k,logbufs=8,inode64,noatime,swalloc 1 3

LABEL=castor10 /srv/castor/10 xfs defaults,logbsize=256k,logbufs=8,inode64,noatime,swalloc 1 3

LABEL=castor11 /srv/castor/11 xfs defaults,logbsize=256k,logbufs=8,inode64,noatime,swalloc 1 3

How to list all the files in the /srv/casto... with "find": find . /srv/castor/06 | grep srv | awk '{print "/afs/cern.ch/user/a/afiorot/public/migration_lhbct3.sh " $1}'

Installation failed for RAID Problems:

- If the installation failed cause RAID configuration non recognized, maybe a disk is missing, check it with df and lsscsi:

- With lsscsi we see that there are 18 disks, but with df and printdiskserver we see only 17 disks.

[root@lxfsrf01c02 ~]# lsscsi

[0:0:20:0] enclosu LSI CORP SAS2X28 0717 -

[0:0:45:0] enclosu LSI CORP SAS2X36 0717 -

[0:2:0:0] disk LSI MR9261-8i 2.12 /dev/sda

[0:2:1:0] disk LSI MR9261-8i 2.12 /dev/sdb

[0:2:2:0] disk LSI MR9261-8i 2.12 /dev/sdc

......

[0:2:16:0] disk LSI MR9261-8i 2.12 /dev/sdq

[0:2:17:0] disk LSI MR9261-8i 2.12 /dev/sdr

[0:2:18:0] disk LSI MR9261-8i 2.12 /dev/sds

- So we proceed to create a new disk: [root@lxfsrf01c02 ~]# df

Filesystem 1K-blocks Used Available Use% Mounted on

/dev/sda2 15481840 1557264 13138144 11% / tmpfs 6093444 0 6093444 0% /dev/shm

/dev/sda1 1032088 92324 887336 10% /boot

/dev/sda3 15481840 169612 14525796 2% /tmp

/dev/sda6 2064208 3080 1956272 1% /usr/vice/cache

/dev/sda7 149653804 1014552 141037284 1% /var

/dev/sdb 2739604500 2602231868 137372632 95% /srv/castor/01

/dev/sdc 2928256020 2779146068 149109952 95% /srv/castor/02

......

/dev/sdq 2928256020 2781159984 147096036 95% /srv/castor/16

/dev/sdr 2928256020 2780221168 148034852 95% /srv/castor/17

/dev/sds 2928256020 32928 2928223092 1% /srv/castor/18 --> Disk created with -->> mkfs.xfs -L castor18 /dev/sds

How to check in which group a machine is in EOS:

- eos node ls

- eos fs ls -d p05614923v23967.cern.ch

- eos ns stat | grep -i drain

- eos fsck stat

- report

- eos fsck stat

- eos node ls

- eos group ls

- eos group ls -l wigner.26

How to set ACLs: [root@eospublic-srv-b1 ~]# # eos attr -r set user.acl="u:afiorot:rwx" /eos/theory/project/abc^C

[root@eospublic-srv-b1 ~]# # eos attr -r set user.acl="egroup:eos-admins:rwx" /eos/theory/project/abc

[root@eospublic-srv-b1 ~]# # eos attr -r set user.acl="egroup:eos-admins:rwx,u:afiorot:r" /eos/theory/project/abc

How to stop the Roger-Listener:

- Check that the Roger listener is active --> auxf | grep roger

- Check the rpm: rpm -qf /usr/sbin/roger-castor-listener rpm -ql roger-castor-listener-0.0-4.el6.noarch

- Stop the daemon --> /etc/rc.d/.d/roger-castor-listener stop

- Change the permission in order to block its execution --> chmod -x /etc/rc.d/init.d/roger-castor-listener

- Check that the service is now not allowed to execute:

[root@c2atlas-2 ~]# service roger-castor-listener status env: /etc/init.d/roger-castor-listener: Permission denied

CASTOR log to EOS:

- Recall all the files from tape to disk:

[root@c2public-2 ~]# stager_get -f /afs/cern.ch/user/a/afiorot/LISTS/eosalice_feb_logs.txt

/castor/cern.ch/c3/eoslog/eosalice/eosalice-srv-b1.cern.ch/xrdlog.mgm-2016-02-04-1454562061.gz SUBREQUEST_READY

/castor/cern.ch/c3/eoslog/eosalice/eosalice-srv-b1.cern.ch/xrdlog.mgm-2016-02-09-1454976062.gz SUBREQUEST_READY

/castor/cern.ch/c3/eoslog/eosalice/eosalice-srv-b1.cern.ch/xrdlog.mgm-2016-02-09-1454990461.gz SUBREQUEST_READY

/castor/cern.ch/c3/eoslog/eosalice/eosalice-srv-b1.cern.ch/xrdlog.mgm-2016-02-09-1455026462.gz SUBREQUEST_READY

......

- Login to eospps (in the slave: p05508916e55509) as root

- Check if there is enough space in /var and create a folder to store the logs

- Copy the logs: xrdcp root://castorpublic.cern.ch//ca.../xrdlog.mgm-2015-12-10-1449784862.gz /var/log/eosalice/

- In order to transfer more logs:

[root@p05508916e55509 ~]# for i in `cat /afs/cern.ch/user/a/afiorot/LISTS/stagerget.txt`; do echo "xrdcp root://castorpublic.cern.ch//castor/cern.ch/c3/eoslog/eosalice/eosalice-srv-b1.cern.ch/"$i " /var/log/eosalice/"$i; done

EOS Permissions:

- The first level of permission are the one

- After the first level is applied there are also the permission defined in the ACLs

MCO:

- run --> ssh-agent

- Copy/ the following lines:

SSH_AUTH_SOCK=/tmp/ssh-kLvWkF0anhRz/agent.11850; export SSH_AUTH_SOCK;

SSH_AGENT_PID=11851; export SSH_AGENT_PID; echo Agent pid 11851;

- check if there is any identity --> ssh-add -L

- if not, add it --> ssh-add

- check again --> ssh-add -L

- use MCO --> mco find -T eos

In case ssh on a virtual machine is not working recreate the key:

[afiorot@aiadm060 ~]$ ssh root@centos-samba2 -v

- cat .k5login

- cern-get-keytab --help - cern-get-keytab --keytab /etc/krb5.keytab --force --verbose

- service sshd restart

Check how groups are mapped in the EOS installations:

[afiorot@aiadm070 ~]$ cat /afs/cern.ch/project/eos/installation/theory/etc/setup.sh

# source me alias eos="/afs/cern.ch/project/eos/installation/0.3.84-aquamarine/bin/eos.select" alias eosumount="/afs/cern.ch/project/eos/installation/0.3.84-aquamarine/bin/eos.select -b fuse umount" alias eosmount="/afs/cern.ch/project/eos/installation/0.3.84-aquamarine/bin/eos.select -b fuse mount" alias eosforceumount="killall eosfsd 2>/dev/null; killall -9 eosfsd 2>/dev/null; fusermount -u "

[afiorot@aiadm070 ~]$ cat /afs/cern.ch/project/eos/installation/0.3.84-aquamarine/bin/eos.select

#!/bin/bash export EOSSYS=/afs/cern.ch/project/eos/installation/0.3.84-aquamarine export EOS_TMP_MGM_URL=$EOS_MGM_URL if [ "x$GROUP" = "x" ]; then

GROUP=`id -gn` fi

# map to cms or atlas or pps instance if not manually set if [ "$GROUP" = "zp" ]; then export EOS_TMP_MGM_URL=${EOS_MGM_URL-"root://eosatlas.cern.ch"} fi if [ "$GROUP" = "zh" ]; then export EOS_TMP_MGM_URL=${EOS_MGM_URL-"root://eoscms.cern.ch"} fi if [ "$GROUP" = "c3" ]; then export EOS_TMP_MGM_URL=${EOS_MGM_URL-"root://eospps.cern.ch"} fi if [ "$GROUP" = "z5" ]; then export EOS_TMP_MGM_URL=${EOS_MGM_URL-"root://eoslhcb.cern.ch"} fi if [ "$GROUP" = "va" ]; then export EOS_TMP_MGM_URL=${EOS_MGM_URL-"root://eosams.cern.ch"} fi if [ "$GROUP" = "xv" ]; then export EOS_TMP_MGM_URL=${EOS_MGM_URL-"root://eosams.cern.ch"} fi if [ "$GROUP" = "xu" ]; then export EOS_TMP_MGM_URL=${EOS_MGM_URL-"root://eospublic.cern.ch"} fi if [ "$GROUP" = "vy" ]; then export EOS_TMP_MGM_URL=${EOS_MGM_URL-"root://eoscompass.cern.ch"} fi

Kernel Problem (if a machine does not reboot because of a kernel discrepancy)

If the the kernel version in /boot/grub/grub.conf (where the system boot) is different on the one in /etc/grub.conf

- Copy /etc/grub.conf on /boot/grub/grub.conf and reboot the machine

How to escape carachters while using ssh: ssh -l root lxfsrd08c02.cern.ch "cat /boot/grub/grub.conf | grep SLC | tr '()' ' ' | awk '"'{print "Boot: " $6}'"' | -1; cat /etc/grub.conf | grep SLC | tr '()' ' ' | awk '"'{print "Etc: " $6}'"' | head -1; uname -r | awk '"'{print "Current: " $1}'"'; rpm -qa | grep ^kernel-2 | head -1 | cut -c8-35 | awk '"'{print "Last RPM: "$1}'"'"

How to see RAID configuration: - lsscsi -g

- hwraidman info

- history

- arcconf getconfig 1 al | less

- arcconf getconfig 1 al | grep Sta

- arcconf getconfig 1 al | less

- history

- /afs/cern.ch/project/sysadmin/tools/bin/adaptec_analyze_disks.sh

- arcconf getconfig 1 al

- arcconf getconfig 1

- arcconf getconfig 1 LD

- arcconf getconfig 1 PD

- storcli

How to check the location – rack – zone of a machine: landbGetInfo lxfsrd30c04.cern.ch

DeviceName = LXFSRD30C04

Location = 0513-R-0050

Zone = RD30

SerialNumber = CD1001523-6M004498TS

Manufacturer = SUPERMICRO

Status = ACTIVE

Model = SC847E16-R1400LPB

GenericType =

ResponsiblePerson = EOS-ADMINS E-GROUP IT-ST

LXFSRD30C04_IP0 = 128.142.21.177 LXFSRD30C04_MAC0 = 00-0E-1E-04-F3-40

LXFSRD30C04-IPMI_IP1 = 10.9.56.121

LXFSRD30C04-IPMI_MAC1 = 00-25-90-71-47-FB

eos cp:

Regarding eos cp, it is not working because the path you wrote is not considered a valid path: - the slash at the beginning of the path has to be present - since "test" is a directory and not a file, you have to put the slash also at the end

eos cp goofy.txt /eos/user/a/amereghe/test/

You can use also a safer way by specifying the destination server of your copy: eos cp goofy.txt root://eosuser.cern.ch//eos/user/a/amereghe/test/

eosmount: eosmount ~/eos/ ls -l ~/eos eos ls -l /eos/

How to check how many times a machine has restarted/rebooted:

Feb 8 09:56:50 p05614923x01304 kernel: Clocksource tsc unstable (delta = -8589931255 ns). Enable clocksource failover by adding clocksource_failover kernel parameter.

Feb 8 10:06:04 p05614923x01304 kernel: imklog 5.8.10, log source = /proc/kmsg started.

[root@p05614920b83579 ~]# grep "Clocksource tsc unstable" /var/log/messages-20160207

Check also: cat /var/crash/ ls -ltr /var/crash/ cd /var/crash/127.0.0.1-2016-01-13-17:51:58/ cat vmcore-dmesg.txt grep -B 1000 "/proc/kmsg started" /var/log/messages grep "clocksource" /var/log/messages grep "sda7" /var/log/messages ls -ltr /var/crash/

error: couldn't get meta data information → probably the file has been grep in a bad way, CAREFUL to the spaces between words

Disk status → smartctl -a /dev/sda

How to deletediskcopy in more diskservers: for a in `cat /afs/cern.ch/user/a/afiorot/LISTS/repackmov.txt`; do for i in `seq 1 3`;do echo "deletediskcopy $a:/srv/castor/0$i/ &"

How to check if puppet is disabled:

[root@lxc2dev4d1 ~]# cat /var/lib/puppet/state/agent_disabled.lock

{"disabled_message":"Want to use the XROOT protocol"}

If the file is present puppet is disabled

How to install EOS on a local machine: 1) login with your kerberos credential → kinit [username] 2) export EOS_MGM_URL="root://eos[instance].cern.ch" 3) source /afs/cern.ch/project/eos/installation/lhcb/etc/setup.sh

4) eos ls

5) xrdcp

- In case there are libraries missing, install them: yum install readline compat-readline5 yum install openssl098e

If eoscockpit-quota is down and there is no monitoring available: check the status of thttpd in the machine, if it is stopped, restart it.

How to check if a machine has a 1Gb/s connection:

[root@lxfsrd55a04 ~]# ethtool eth0

Speed: 1000Mb/s

How to check if the machine is dropping packets:

[root@eoscmsftp01 ~]# ifconfig eth0

RX packets:2734588117064 errors:9 dropped:985822 overruns:0 frame:9

check in network.cern.ch

How to check network cards (brands) in a cluster:

[afiorot@aiadm058 ~]$ wassh -z -l root -cl eos/alice/storage "lspci | egrep -i "'\(Ethernet\|InfiniBand\).*\(QLogic\|mellanox\|Chelsio\)'"" | egrep '^[a-z]' | cut -d':' -f1

How to retire a gridftp server:

It is enough to it → touch /etc/iss.nologin → in this way the machine is not visible anymore

To check how long a process has been open:

[root@eoscms-srv-m1 ~]# zcat /var/eos/report/2015/08/20150809.eosreport.gz | grep "rb=0" | tr '&' ' ' | egrep "ots|cts" | awk '{print $10, $12}' | tr '=' ' ' | awk '{print "echo ` "$4 " - "$2 "` " " cts= " $4}' | sh | sort -n

How to find files in CASTOR nodes: for i in `seq 1 9`; do echo "find /srv/castor/0$i/ | grep castorns | awk '{print \"/afs/cern.ch/user/a/afiorot/public/migration_lhbcdisk.sh \" \$1}' | sh &"; done find /srv/castor/01/ | grep castorns | awk '{print "/afs/cern.ch/user/a/afiorot/public/migration_lhbcdisk.sh " $1}' | sh & find /srv/castor/02/ | grep castorns | awk '{print "/afs/cern.ch/user/a/afiorot/public/migration_lhbcdisk.sh " $1}' | sh & find /srv/castor/03/ | grep castorns | awk '{print "/afs/cern.ch/user/a/afiorot/public/migration_lhbcdisk.sh " $1}' | sh &

..... find /srv/castor/08/ | grep castorns | awk '{print "/afs/cern.ch/user/a/afiorot/public/migration_lhbcdisk.sh " $1}' | sh & find /srv/castor/09/ | grep castorns | awk '{print "/afs/cern.ch/user/a/afiorot/public/migration_lhbcdisk.sh " $1}' | sh &

EOS:

- ssh -l root eospps --> Login as Root, otherwise there are not enough permissions.

- eos node ls --> List of Machines in a node.

- eos node ls -l [hostname] --> List characteristics of a specific machine.

- eos ns --> List namespace details.

- eosadmin node ls -l $HOSTNAME --> to run inside a machine.

- eos --> Commands list.

- eos fs ls -l --> List filesystems.

- eos fs status [FSID] → to check how many files are not accessible or at risk!

- eos io stat -x --> List of draining ongoing/ application in the instance - eos space ls --> get an overview of all the space allocated in an instance

EOS Quota:

- eos quota ls -p [path]

- eos group ls --> TO BE CHECKED ALSO, can be useful in case of “No left space on device” --> check for groups filled >90%

- eos attr ls [path]

- eos -r [uid] [gid] member [e-group]

- eos -r [uid] [gid] ls -l [path] → it takes the role of the user and try to ls a file

- eos member [e-group]

- eos whoami

- eos ls -ld [directory_path] → when looking for permissions in a directory

- eos attr get sys.acl [directory_path]

- eos chmod 755 [path]

- eos ls -ld [path]

How to check if the Wigner link and network performance are fine: https://netstat.cern.ch/monitoring/network-statistics/ext/?p=EXT&q=CERN&mn=Wigner&t=Daily

How to check to which experiment a group belongs (from aiadm): ldapsearch -z 0 -E pr=1000/noprompt -LLL -x -h "xldap.cern.ch" -b "OU=Unix,OU=Workgroups,DC=cern,DC=ch" "(&(objectClass=group)(gidNumber=*))" samaccountname description | grep [groupname]

Example: description: Linux group zh - CMS sAMAccountName: zh

How to request EOS space, quota, or creation of directory in

ATLAS --> Put Tomas Kouba in copy in the ticket, or send an email to: [email protected]

CMS --> Re-assign the ticket to CMS EOS/CASTOR Support

Put Daniel Valbuena Sosa and cms-eos-interlocutors (Giovanni, Nicolo, Gianluca) in copy and reply to the ticket

How to create a home directory for a user: eos mkdir /eos/theory/user/a/alioli eos ls -ld /eos/theory/user/a/alioli eos chown alioli:t3 /eos/theory/user/a/alioli eos quota set -u 69459 -v 1T -i 50k -p /eos/theory/user/ → the quota has to be set on the QUOTANODE eos quota ls -u alioli -p /eos/theory/user

In case the UID and the username of the users are not syncronized, send (caching for e-groups):

/usr/bin/eos space reset default --egroup

How to create multiple home directories: for i in a b c d e f ...; eos mkdir /eos/.../user/${i}; eos chown root:[group] [path]; eos chmod 750 [path]

How to check is there are files in a machine without filesystems in EOS:

[root@lxsl4409 ~]# find /data* -type f | grep -v xsmap | grep -v scrub | grep -v eosfs | ls -l | grep -v " daemon daemon 0 "

How to set quota to a group: eos quota set -g [group] -v 80TB -i 5M -p /eos/[group]/

EOS Mount on a LXPLUS machine:

How to check if eos is mounted in all lxplus machines:

~$ wassh -q -z --cl bi/inter/plus/login -- ps axu \| grep -v grep \| grep eos \| grep cylin

[afiorot@lxplus0056 ~]$ source /afs/cern.ch/project/eos/installation/user/etc/setup.sh

[afiorot@lxplus0056 ~]$ mkdir testeosmount

[afiorot@lxplus0056 ~]$ eosmount testeosmount warning: assuming you gave a relative path with respect to current working directory => mountpoint=testeosmount

OK

===> Mountpoint : /afs/cern.ch/user/a/afiorot/testeosmount ===> Fuse-Options :

......

===> fuse write-cache : 1

===> fuse write-cache-size : 100000000

The destination directory MUST BE EMPTY otherwise the mount will not work!

How to delete an EOS directory:

If a user ask to delete his directory on EOS, check who is/are the owner of the files inside that directory: eos find --uid [path]

Check errors in the instances:

1. List of errors --> eos fsck stat or report 2. eos fsck report -a -l --error [error_type] 3. eos node ls --sys 4. How to repair errors:

- Check the FSID of the error with -a

- Check if the FS is corrupted or not: if there is no correspondence between FSID and the errors continue

- Drain the FSID, remove it and umount it. After it reinstall the FSID with eos-filesystem...

5. Move filesystem from spare to default when the machines are filled over 80%:

- eos group ls | grep default | grep "9.\..." | awk '{print "eos fs mv spare " $2}' | sh

Various EOS Operations:

every eoscp process --> ps aux | grep -v grep | grep eoscp | awk '{print “kill -9 ” $2}' | sh  Check if wopen are 0 --> eos node ls --io [host] --> eos node ls --io | awk '{if($4!=0){print $0;}}'  Update machine --> puppet agent --enable; puppet agent -tv  Turn on alarms and put machine in production --> roger show --> roger update  Check the eos version in the machine --> rpm -qa | grep eos

Operations on EOS Machines:

- MoveOutStorage --> Drain of the machine.

- eos node ls --> They must be “Drained” and “Empty”. - eos ns stat | grep -i drain --> Fail rate per minute

- eos ns stat | grep Drain; eos io stat -x

- eos space ls --> details of the instance.

How to check if in a machine there are no files:

- eos node ls --io [Hostname] --> If [used_files] is 0 the machine is empty.

- MoveInStorage --> Install new storage

- roger [show,update] [machine] --> Show or update the machine status.

- service eos status --> Check running processes..

- ping [options] [destination_host] --> Test a network connection.

Need to be run in the node: ssh -l root [machine]:

- lemon-host-check --> Alarm state management.

- puppet agent -tv --> run puppet.

- df --> check filesystem.

- df | grep -c data --> Show number of filesystem.

- lsscsi | grep sd | -l --> Num. of filesystem that machines must have. {Num = N – 1}.

- ps auxf --> report a snapshot of the current processes.

- [options]... [file]... --> Output the last part of files, print the last part of each file.

tail -f [file] --> Show current process.

ks-post-anaconda.log

anaconda-post.log

- fedacli --> Hardware diagnosis.

- cat /etc/motd --> Check errors in the machine

How to move filesystem from a group (default.??) to spare: eos node ls -l HOSTNAME | grep empty | grep default | awk '{print "eos fs mv "$3" spare"}' | sh

How to list and count all files recursively under a EOS directories: (find is recursive by default) eos find -f --size --checksum /eos/lhcb/lbdt3/user/ | wc -l eos find --count /eos/atlas/atlascerngroupdisk/det-alfa/pcalfa02_backup

Filesystem Operations:

If there is a filesystem missing install it:

- eos-filesystem-setup.sh --dev /dev/[fs_name] --wipe --> Inside machine.

How to install all filesystem:

If there are no fs check if the installation is done:

- ll and check the data-time of the files and INSTALL_SUCCESS in anaconda.log

If everything is ok install them:

- /usr/bin/eos-hwconfiguration-setup.sh --force --wipe

If there are filesystem not registered check where is the failure in the script inside the machine:

- ls -lrt --> sort script by time. - cat ks-script... & cat ks-post-anaconda.log & less puppet...log --> Check where is the problem.

And register them:

- /usr/bin/eos-register-all-filesystem-inside-eos-instance.sh

If there are filesystem in bootfailure or opserror and rw or drain DO NOT remove them because there are files:

- eosadmin fs ls -e | grep [hostname] --> errors --> check if /var is full

- Try to umount, re-mount and boot the filesystem to see if the error is fixed.

- sync

- dmesg

- check if there are FST or xfs error

- use eos-stop-gently.sh - - restart

- Set the node in RO, until there are NO writes/reads, and reboot it.

If there are DB filename, SQLITE DB and resync errors:

- LOGIN in the machine

- check which filesystem is in bootfailure and with SQL or DB problem:

- eosadmin fs ls -e $HOSTNAME

- ls -l /var/eos/md/fmd.[fsid].sql

- if present remove it → rm /var/eos/md/fmd.[fsid].sql

- reboot the FS

For multiple Fss: eosadmin fs ls -e $HOSTNAME | grep data; eosadmin fs ls -e $HOSTNAME | grep data | awk '{print "ls -l /var/eos/md/fmd."$2".sql"}' | sh

for i in `eosadmin fs ls -e $HOSTNAME | grep data | awk '{print $2}'`; do echo "eosadmin fs boot $i --syncmgm"; done

If there is a (fsid+uuid) error:

- ls -l /data[n]/.eos* --> Search which is the file with the error

- rm /data[n]/.eosfsuuid --> Remove the file in error

- eosadmin fs boot [fs_ID] --syncmgm --> Reboot the fs

If the FSID are already assigned, but is not possible to see the disks, check if there are .eosfsuuid in each disk and remove them, but only if the machine is EMPTY!

To remove all fsuuid error in a machine:

- for i in `seq -w 1 [n_data]`; do ls -l /data${i}/.eos*; done --> Find all files in error, then remove them.

If it doesn't work remove the fs and reinstall it:

- eos fs rm [ID_fs] From

- umount /data[num] head

- eos -b fs ls | grep MACHINE_NAME | grep empty | awk '{print "eos fs rm " $3}' | sh node

--> Remove multiple ID_FS

If the machine is down try to reboot it and ping it. If this doesn't work open a ticket:

- ai-remote-power-control cycle [hostname] --> From aiadm for Puppet machines.

If a machine does not get installed check that the aims2client options are well set and present:

- nostorage, driverload, ecc.

- To set new options execute:

- ai-installhost --aims-kopts "wipe-disk nostorage driverload=isci" [hostname]

- Check if PXE is enabled on the machine --> aims2client showhost [hostname]

- If it is not enabled activate it with --> aims2client pxeon [hostname] [imgname]

- To see the available images --> aims2client showimg all - Re-try the installation.

If the machine does not boot because it is stuck at PXE boot (pxe bug), set the machine to boot from disk (aiadm):

- ai-ipmi get-creds xxxxxxxxxx.cern.ch (xxxxx e il nome del computer) - ipmitool chassis status -H xxxxxxxxxxxx-ipmi.cern.ch -U "xxxx" -P "xxxxxxxxxx"

- ipmitool chassis bootdev disk -H xxxxxxxxxxxx-ipmi.cern.ch -U "xxxxxx" -P "xxxx" - ai-remote-power-control cycle hostname - Turn on the alarms and reset the machine in production.

How to modify the Kickstart file when the system's disk is installed wrong:

- ai-foreman installhost [hostname]

- aims2client showks [hostname] > [hostname].ks

- [hostname].ks

Go to the partition table and add --ondrive=sd[n] , [n] should be the small on in lsscsi

part ...... --ondrive=sdy...

part ...... --ondrive=sdy...

- Save with :wq

- aims2client updateks [hostname] [hostname].ks

- Reboot the machine

How to repair stalling draining: (instead of drained/empty fs):

1. - eos fs ls -d | grep -v drained --> Show stalling or draining filesystem. 2. - eos fs dumpmd [FS_ID] -path --> Show the path of FS. 3. - eos [options] file check [path] --> Check size between the original and copies, and if a copy is different from the original, it is broken.

- If there is “Error meta data” or “Unable to retrieve file” try to repair the machine otherwise try to recover the replica from the other machine with --> eos file info [path]. - If eos file check doesn't work use: eos -b file check and with file info search the problematic machine.

4. CAREFUL with --> eos file drop [path] [FS_ID] --> Remove definitively the broken copy.

Broken FS_ID

5. - eos file adjustreplica [path] --> Repair the broken copy

6. If there are more than 2 copies re-send the last command and re-check with eos file check..

7. If there are no broken copies try to check the checksum:

- eos file verify [path] -checksum --> Update checksum.

Use COMMITCHECKSUM ONLY when the checksum is 0000100000 and the size is different from 0:

- eos file verify [path] -checksum -commitchecksum --> When the checksum in the namespace is corrupted.

- eos file check [path] --> Verify if checksums are the same.

8. If size, statsize and checksum are correct:

- ssh -l root [host] “eos-adler32 [fstpath]” --> Check the conformity between Adler32 and checksum.

- If everything seems ok try adjustreplica and look if the error repair itself.

9. If it is impossible to find replica or there is something wrong investigate in the log --> eos file info [pathname]

- zgrep [FXID] /var/log/eos/fst/xrdlog.fst-[data].gz --> In the machine

- zgrep [FXID] /var/log/eos/mgm/error.log : msg="checksum mismatch disk/mgm vs memory" fid=09cc539f fsid=16863 checksum=8442ba29 diskchecksum=1128295b mgmchecksum=8442ba29

- In case the mgm and the file on disk have the same checksum, send a -commitchecksum

- in case there is a mismatch error between mgm/disk and memory, log in the machine:

- eos-adler32 [fstpath]

- eos-check-blockxs [fstpath]

- eos-compute-blockxs [fstpath]

- try to find the file in all the machines in the cluster: wassh -l root -cl eos/public/storage “ls -l /data??/00003f4c/09a8acf1*” 2>/dev/null

* /00003f4c/09a8acf1 --> can be found doing --> eos file info [filepath] - - fullpath

- if it's not possible to find anything and adjustreplica doesn't work, try to restart the fst: service eos restart fst - if eos file check stays hanging and do not respond, could be a monitoring dead lock → check number of threads in the machine → eosadmin fs ls - - sys $HOSTNAME –> if >1000 threads, restart fst

- if there is an error like “Machine not in the network”, check that the machines where the replicas are have been set to readonly or readwrite.

- in case it is not possible to read a replica check if there are errors in the network cables with –> ifconfig

10. Run the draining again with --> eos fs config [FS_ID] configstatus=drain

- if the file is corrupted try to copy it in another location --> if=fstpath of=/root/fxid bs=1 conv=noerror

- Check if the file exist with eos file check and info

- if eos file check remains hanging, the host could be overloaded:

- check the number of THREADS and ROPEN and WOPEN with:

- eos-node-ls.sh - - sys

- eos-node-ls.sh - - io

- Try to download it in /dev/null

- If the files are accessible ask when there were the problems and investigate on it.

- If there is a log:

- Find it and take the path where the error occurred

- Check when the error happened and investigate in the log: zless [path] /var/log/eos/(mgm o fst)/xrdlog.(mgm o fst).[date].gz

- Check the right time slot:

TID --> Transfer ID

TIDENT --> Transfer Identity = atnight.76724:800@aibuil018

[user] [PID] [port] [dest. Mach.]

- To take a logic reconstruction of the event: (try also in the machine where the redirection happened) - zgrep [LOGID] /var/log...

- grep [date] /var/log/messages

- If in the process found in the log there are: sec=unix uid=99 gid=99 --> the problem could be the Kerberos token (expired)

When a disk is broken and it is not possible to mount, or when you receive an error like:

[root@lxfsrf15c08 ~]# mount /data?? mount: special device LABEL=/data?? does not exist

- Check if the disk is present:

- lsscsi -g

- df

- lspci

- cat /proc/cpuinfo

- scsi-rescan

- history | grep cli

- storcli /c0 show all → with this command we can see that the related disk is in Failed mode:

34 - - - - RAID0 OfLn N 2.727 TB dflt N N none N

34 0 - - - RAID0 Dgrd N 2.727 TB dflt N N none N

34 0 0 45:23 44 DRIVE Failed N 2.727 TB dflt N N none -

…......

34/35 RAID0 OfLn RW No NRWTD - 2.727 TB 45:23

…......

44 Failed 34 2.727 TB SATA HDD N N 512B Hitachi HUA5C3030ALA640 U

- storcli /c0/e45/s23 show → to show the status of the disk (c0=controller 0, e45=EID, s23=Slot)

- storcli /c0/e45/s23 set online → set the disk ONLINE

- storcli /c0 show all → check that the disk is now online

- parted /dev/sd? Print → TO CHECK - print info regarding the disk - mount /data?? → Remount the disk

- Reboot the FS

What to do in case of XrdSec error:

XrdSec: No authentication protocols are available. error: errc=0 msg=""

- Check if puppet is enabled --> puppet agent -tv

- If not, enable it and execute it again --> puppet agent --enable --> puppet agent -tv

- Check if distro sync did some operation:

locate distro_sync

cat /var/log/distro_sync.log

In case a machine is created by mistake, remove the node with:

[root@eoscms-srv-b1 ~]# eos node ls | grep p05799459t97772

nodesview p05799459t97772.cern.ch:1095 0 10 120 2 49

nodesview p05799459t97772:1095 0 10 120 ~ 0

[root@eoscms-srv-b1 ~]#

[root@eoscms-srv-b1 ~]# eos node rm p05799459t97772:1095

success: removed node '/eos/p05799459t97772:1095/fst'

How to handle ticekt:

- In case the problem is unknown, try to check other SNOW tickets for a solution

Swap_Full:

Check machines's stats in:

- Foreman

- iostat

- Lemon - uptime --> How long the system has been running.

- ps --> Process status.

- ps auxf or ps -ef --> Report a snapshot of the current procecces.

- --> Process viewer, find the CPU intensive programs currently running.

[SHIFT + m] --> Top used memory f --> To add other fieldsm like swap_full

- free -m --> Display amount of free and used memory in the system.

- eosadmin node ls --io [hostname] --> Other information

- eosadmin node ls --sys [hostname] --> “ “

How to check the machines version in the headnode:

- ssh -l root [Instance] “eos -b node ls --sys” | awk '{print $8}' | sort | uniq -c

How to close some useless process use --> service lemon-agent restart

If xrootd daemon on diskservers is causing the alarm:

- Check if there are current writing process --> eosadmin node ls --io $HOSTNAME

- Ensures that everything in memory is written to disk --> sync

- Restart gently the eos fst service --> eos-stop-gently.sh --timeout 3600 ; service eos restart fst

- cat /etc/motd --> Check errors in the machine

- If the exception is on the PPS headnode and it is caused by xrootd, restart it: service eos restart mgm [ONLY WITH PPS]

- to get information from the headnode --> eos ns

Unmounted Filesystems:

- lemon-host-check --> check when it has been started --> ps auxf --> daemon

- /var/log/messages -->> grep -C5 Scalla /var/log/eos/fst/xrdlog..

- dmesg

- mount -a → to see which FS is mounted

- lsscsi -g - df --> df | grep -c data

- roger show

- puppet agent -tv --> If it need to be update --> puppet agent --enable; puppet agent -tv

- rpm -qa | grep eos

Root_FULL: Clean the machine: check the procedure.

tmp_full --> Check which files are occupying space, and in case are backup files, remove the oldest.

Afsd_Wrong:

- Check if AFS is working --> ls /afs/cern.ch/project/

- Check exception with --> lemon-host-check and if it is still there restart it --> service lemon-agent restart

exception.cmsd_wrong:

- Error about the Cluster Management Server --> check if is needed, and disable the exception.

VM_kill:

- Subject: Re: VM_KILL exception - how do I find which process got killed.. maybe I need to restart it? - This can be found in the system log: [root@c2alice-1 ~]# grep 'Out of memory' /var/log/messages Aug 1 07:31:04 c2alice-1 kernel: Out of memory: Kill process 25899 (transfermanager) score 971 or sacrifice child - But you don't need to restart it... Lemon does it for us.

/var_FULL:

- Check in whick subdirectory there are the biggest file to zip with: -ahx /var | grep "[0-9][GT]" | sort -n

- Move all the unzipped logs in ls -l /var/log/eos/fst/ in another disk and zip them --> gzip -9 [path_file]

- Move the file to the original disk --> mv [path_file]

- If the var_FULL is caused by xrdlog.fst stop all services:

- puppet agent - - disable - lemon-host-check - - disable-all

Before stop EOS check if there are writing or reading ongoing:

- service eos stop

- Check that the fst is stopped:

- service eos status

- Check if there are other process running --> jobs

- Move and zip the xrdlog.fst, and put in in /var. Check if the moving is going with --> ls -l [path]/.fst

EXAMPLE:

[root@lxfsre03a02 ~]# xrdcp /var/log/eos/fst/xrdlog.fst-20150506 root://eospps//eos/afiorot/lxfsre03a02.xrdlog.fst-20150506

[xrootd] Total 4208.61 MB |======| 100.00 % [94.9 MB/s]

[root@lxfsre03a02 ~]# rm -f /var/log/eos/fst/xrdlog.fst-20150506

EXAMPLE2:

[root@lxfsre03a02 ~]# mv /var/log/eos/fst/xrdlog.fst /var/log/eos/fst/xrdlog.fst.verybig

[root@lxfsre03a02 ~]# service eos restart fst

Stopping xrootd: fst [ OK ]

Starting xrootd as fst with -n fst -c /etc/xrd.cf.fst -l /var/log/eos/xrdlog.fst -b -Rdaemon

[ OK ]

[root@lxfsre03a02 ~]# xrdcp /var/log/eos/fst/xrdlog.fst.verybig root://eospps//eos/afiorot/lxfsre03a02.xrdlog.fst-20150507

[xrootd] Total 74239.11 MB |====>...... | 20.74 % [57.3 MB/s]

[root@lxfsre03a02 ~]# rm -f /var/log/eos/fst/xrdlog.fst.verybig

exception.nscd_wrong:

- Nscd (name service cache daemon) is a daemon that provides a cache for the most common name service requests.

- Check if the daemon is running --> ps -ef | grep nscd

- Restart the SSSD (System Security Services Daemon) and then the nscd service: - service sssd restart; service nscd restart

In case the nscd does not restart, could be because of a lock file --> check with service nscd status

- if it say nscd dead but subsys locked it means that the process is not existing but the restart is blocked.

To restart it we have to delete the lock file and the process:

- rm /var/lock/subsys/nscd

- rm /var/run/nscd/nscd.pid and then restart the service

Smart_selftest:

- TO BE CHECKED

- Find the disk that has the problem, and reboot the fst --> sync; eos-stop-gently.sh --reboot

- Then try to reinstall the disk, only after has been DRAINED EMPTY --> eos-filesystem-setup.sh --dev /dev/sd[n]

Access problems:

- eos ls -l [path]

- eos attr ls [path]

- log in eospps and download the file –> xrdcp -f root://eos[instance]//eosfilepath test

- check if the download goes fine and start to debug

- check meter.cern.ch at the time when the problem occurred.

- sys.recycle="/eos/atlas/proc/recycle/"

- If a user receive a “permission denied” and he cannot delete his stuff, resyncronize all the mapping:

- eos space reset default - -egroup

- id [user]

CASTOR:

- checkreplicas.py [path] --> To check the state of a file and if it exist

- diskserver_qry [host.cern.ch] --> To see how many files there are in the machine

- printdiskserver [host.cern.ch] --> To see the status of a diskserver

- modifydiskserver -R (To mod also all the fs status) -S [status] [host.cern.ch] --> To modify the status of a machine

- draindiskserver -a - - file-mask=all [host.cern.ch] --> To set a machine in draining

- draindiskserver -q --> To check the status of the draining

- printrecallstatus --> To check the recall status from tape

- blkid --> to get hardware information

GENERAL COMMANDS: draindiskserver -q -f lxfsrc40a06.cern.ch draindiskserver -q -f lxfsrc56a02.cern.ch | grep 'No source found' | awk '{print $3}' | grep castorns > lostfiles cat lostfiles | head 400 | awk '{ print "lxfsrc56a02.cern.ch:" $1}' | xargs deletediskcopy -d grep LVL=E /var/log/castor/nsd.log | cut -c1-15| uniq -c listtransfers | awk '{print $3}' | sort | uniq -c | sort -nr grep username /var/log/castor/stagerd.log | cut -c1-14| uniq -c

How to check if a user has privileges to access castor files:

[root@c2public-2 ~]# stager_listprivileges -S cdr -U agridin

ServiceClass User Group RequestType Status compassuser * vy * Granted

[root@c2public-2 ~]# id agridin uid=79029(agridin) gid=2766(def-cg) groups=2766(def-cg) the user need to be part of the group with provileges

check also with → nsgetacl [path]

Change permissions of a directory:

- nschmod [750] [path]

How to check:

- nsgetacl [path]

If there is some problem executing commands:

- login in all the headnodes (c2alicesrv101, c2alicesrv201, c2alicesrv301) and check if the command works

- if not, check the status and the log of the transfermanagerd in every headnode, and in case, restart it

- check also the status and the log of the diskmanagerd in the diskserver

High_load:

- Check lemon, and try to decifer why there is a high load. There is probably a process that has gone bananas.

- Check the load with --> top | grep load

- Check the numbers of cores in the machine --> grep "model name" /proc/cpuinfo | wc -l

- If with TOP there are rfio process without details probably it is a repack jobs

- repack -s --> check status of repack

- nslisttape --> List tapes

Swap_Full:

- Check which process is using swap and investigate it -->>> top → -F → order processes by SWAP used

- free -tm

- /etc/init.d/ --> list all services

- if it is caused by maemanager.log: - kill all the maemanager.log of the PREVIOUS days --> kill -9 [PID]

- ps auxf | grep /usr/bin/maemanager.py | grep -v "...." | awk '{print "kill -9 " $2}'

- restart the current maemanager.log --> service simplevisor restart root/mae/mae-consumer

- In case the swap_full is caused by other kind of mae:

- service simplevisor check

- service simplevisor restart root/mae

- In case the swap_full is caused by hbase:

- service simplevisor restart root/hbase

Data_Full:

- Check that the disk is set with DISK1BEHAVIOR “false” --> printsvcclass

- If “TRUE” the garbage collector does not work

- Log Garbage collector --> gcd

Late Migration:

- Check if the machine is in Production/Readonly --> printdiskserver -f [hostname.cern.ch]

- Check if there is a Migration Job and its status --> checkreplicas [path] and printmigrationstatus

- Check showqueues and printtapepool (check how many drives a tape pool has, maybe they are not enough)

- Check if there is some 0 size file or some mismatch between:

- Name Server size --> checkreplicas [path]

- DiskServer size (log in the machine) --> ls -l [physical_path]

- If there is a mismatch error in the cockpit it is possible to force the size in the nameserver: nssetfsize -x [size] [path]

- Check if there are files with bad checksum:

- Nameserver-side checksum --> nsls - - checksum [File_Path]

- Diskserver-side checksum (log in the host) --> xrdadler32 [Physical_Path]

- Disk meta-data checksum “ “ --> getfattr -d [Physical_Path]

- If yes, repair them with the Xavier's script (look the procedure) - If a migration job is lost cause some error, create it again --> migratenewcopy [file_ID]

- Check errors in the log --> grep “error” /var/log/castor/rfiod.log or less /var/log/castor/rfiod.log

- Check also --> grep [FileID] /var/log/castor/rfiod.log

Recall Job & Stager Problems:

- StageIN: The file is being recalled from tape.

- Staged: The file has been seuccessfully staged from tape and it is available on disk.

- StageOUT: The file is being staged from client.

- If there is a StageIN request for a long time due to a not existing Recall Job:

- Check errors in the cockpit

- Do checkreplicas [path] and check the status of the Request

- If there is no RecallJob, create it --> stager_get -M [path_file]

- Check the status --> stager_qry -M [File_path]

- If the job is not created, try to create it again changing the serviceclass: stager_get -M [path_file] -S [Svcclass]

- To get information from the tape --> vmgrlisttape

Access Problems:

- How to try to read a file in Castor --> xrdcp -f root://castorcms.cern.ch//[path]?svcClass=[svcClass]

Check:

- less /var/log/messages

- less /var/log/xrootd/manager/xrootd.log

- To have a list of transfer --> listtransfer -s

- How to check privileges in a pool --> stager_listprivileges -S [serviceclass] -U [user] -G [group]

- How to retrieve information from a machine or file or ID --> stager_qry or printdiskserver

How to list all machine in a cluster:

- How to list all machines in Puppet and Quattor from aiadm --> wassh -c [cluster] - - list or wassh -l root -cl eos/user/storage --list

- How to list all machines in Puppet from aiadm --> wassh -c [cluster] - - ssm foreman - - list

- How to list all machines in Quattor from aiadm --> wassh -c [cluster] - - ssm cdbsql - - list

- How to do a wassh of a list of machines → wassh -z -q -f /tmp/list-test.txt -l root 'uptime'

Raid_adaptec error:

- Log in the machine and check if the error is still there --> lemon-host-check

- Run hwraidman info to check if there is some degraded disk

- If yes, copy & paste the output and reassign the ticket to Sys Admin 2nd Line

Exception.nonwriteable_filesystems:

- lemon-host-check --> check if the exception is still there

- check the logs --> dmesg

- tail /var/log/messages

- tail -50 /var/log/castor/

/diskmanagerd.log

/stagerd.log

/rfio

/rfcp

- check the disk --> df or cat /etc/fstab or lsscsi or s2cli -f

- check if the disk is accessible and the permissions are ok --> ls -l /srv/castor/[N.fs]

- if it seems there are access problems, umount and mount the disk:

- umount /srv/castor/[N.fs]

- mount /srv/castor/[N.fs]

- restart the diskmanagerd and check if the machine is online --> service diskmanagerd restart

- printdiskserver -f [hostname] on the headnode

- hardware check --> hwraidman info - If there is some error associated with a disk:

- umount /dev/sd[disk]

- xfs_check /dev/sd[disk] CHECK THE PROCEDURE

- mount /dev/sd[disk] or mount -a (mount all stuff from /etc/fstab)

- Check the software RAID status --> mdadm - - details - - scan

- Check the RAID status → cat /proc/mdstat

Corrupted file in CASTOR:

/var/log/messages --> check what happened at the same time of the writing

/var/log /lemon... --> check if there were some lemon error in that time

How to check is the software RAID has the correct block size:

- xfs_info /dev/md116

= sunit=32 swidth=64 blks = sectsz=4096 sunit=1 blks, lazy-count=1

How to restart NSD daemons in CASTOR in a trasparent way in all nameservers: (memory leak)

- nslookup castorns → the answer is: 4 IP addresses that identify the 4 castor nameservers

- log in the 1 IP → touch /etc/iss.nologin → in this way the machine is not visible anymore

- nslookup castorns → check that the 1 has disappeared and the logs are not being updated → if YES, restart the NSD daemon → service nsd stop → service nsd start

- check that the processes are running and check also from aiadm if you receive any answer → CNS_HOST=c2central-1 nsls /

- rm /etc/iss.nologin

- nslookup castorns → check that the 1 is back again

- repeat the same procedure for all the nameservers, one per time

- check in the logs that everything is fine: nsd, dmesg, messages,

How to kill PENDING transfers in CASTOR:

List the transfers and dump them in a file --> listtransfers | grep [host] | grep PEND > lxfsrf01c03.cern.ch.xfers

Split the file in others with max 500 tranfers (faster and ligher) --> -l 500 lxfsrf01c03.cern.ch.xfers

Check if the multiple files are there --> ls -ltr

List them --> for i in `ls x*`; do echo $i; done

Kill al the transfers --> for i in `ls x*`; do cat $i|xargs killtransfers; done

How to see in which serviceclasses a file is staged: stager_qry -M all:/castor/cern.ch/compass/data/2014/raw/T06/cdr13057-254708.raw

/castor/cern.ch/compass/data/2014/raw/T06/cdr13057-254708.raw 1389626239@castorns:compassuser STAGEIN

/castor/cern.ch/compass/data/2014/raw/T06/cdr13057-254708.raw 1389626239@castorns:compasscdr STAGED

/castor/cern.ch/compass/data/2014/raw/T06/cdr13057-254708.raw 1389626239@castorns:compassuser STAGEIN

How to create a directory for a user:

- nsmkdir /[path]/castor/user/r/rain

- check fileclass: (put the same fileclass of other user in the same group)

- nsls -d –class [path]

- Set permissions --> nschown user:group [path]

- Check permissions --> nsls -ld [path]

When a draining fail in CASTOR:

- check time slot of all of them

- check if the client received an error

- check errors

- if the filesize is 0 in the namespace and in the diskcopy, do --> deletediskcopy or get stuck:

- Lots of D2Dsrc transfers but no D2Dsrc running:

1) service transfermanagerd stop 2) login in the diskserver --> service diskmanagerd restart;

3) service transfermanagerd start

- repeat this procedure for all the headnodes (try to do it as fast as you can)

How to check if a user is set in the gridmap file:

- ssh -l root eos[exp] “grep [username] /etc/grid-security/grid-mapfile”

How to monitor a value in all the instances:

- Create a script with the command to execute

- Execute the script in all instances with wassh: wassh -l root instance1,instance3 "/afs/cern.ch/user/a/afiorot/public/monitor-threads.sh"

/afs/cern.ch/user/a/afiorot/public/monitor-threads.sh --> eos node ls --sys | awk '$7>315'

- Set a cron job with watch every tot time and show the output: watch -n 120 ssh -l root eoscms "/afs/cern.ch/user/a/afiorot/public/monitor-threads.sh"

What to do if a transfer doesn't start:

- try to copy the file locally and check the problem:

- RFIO_TRACE=7 STAGE_HOST=castorpublic STAGE_SVCCLASS=default rfcp /castor/cern.ch/...book .

- xrdcp -d3 root://castorpublic//castor/cern...ook . -OssvcClass=default

- In case there is a TCP problem, could be due to an outdated config file → INC0677267 and INC1009833

- login the diskserver → cat /etc/xrd.cf.server

#------

# Security plugins

#------xrootd.seclib /usr/lib64/libXrdSec.so

#sec.protocol /usr/lib64 unix #TPC was broken due to this

In this case we received an error: full queue or no response, so we checked the listtransfer of the pool and we saw that everything was blocked. After killing all the jobs and restarting the rfio daemon, everything went fine.

INTERVENTIONS:

Carefull when doing operations in the HEADNODE (NOT use find, xrdcp, hostname, ect.)

What to do when an instance goes down:

- Search in the mail for the same subject/problem

- Have a look for existing solved ticket in SNOW regarding the same issue

- Search in Jira for the same issue

- Have a look at the previous outages

- Check the status and have a look in the logs: - /var/log/eos/mgm/xrdlog

- dmesg

- /var/log/messages

- /var/log/eos/mgm/error

- ping

- netstat

- eos stat

- eos ns

- check the last SLS log

- check the status of the machine and activities on Graphana, SLS or Meter

- Log in the headnode:

- eos ns --> check the overall situation

- eos node ls --> if the heartbeatdelta is low it means that headnode and diskserver are comminicating

- In case the monitoring is red, but the headnode is up, log in eosmon1 and send:

[root@eosmon01 ~]# SRM2_ENDPOINTS=eos[instance] /usr/sbin/EOS-sls-probe.sh eosmon01 --> history --> execute the test script, if it goes fine it was a glitch.

- if it is a space problem --> check quota

- grep eosmon /var/log/eos/mgm/xrdlog.mgm --> look for the machine where the test has been executed

- check for the func=OPEN and the func=CLOSE. Check how much time the CLOSE function needed to close effectively, could be that the probe timed out.

Transient. At least one of the probes hit a diskserver that took 3 minutes to process the "close" due to the verifychecksum (and the probe timed out):

150722 10:16:30 time=1437552990.466098 func=close level=INFO logid=c2c2c586-3049-11e5-899b-c860001bd93a [email protected]:1095 tid=00007fe6c61fc700 source=XrdFstOfsFile:1715 tident=root.3632:508@eosmon02 sec= uid=18118 gid=2688 name=nobody geo="" calling verifychecksum

150722 10:16:30 time=1437552990.466126 func=verifychecksum level=INFO logid=c2c2c586-3049-11e5-899b-c860001bd93a [email protected]:1095 tid=00007fe6c61fc700 source=XrdFstOfsFile:1480 tident=root.3632:508@eosmon02 sec= uid=18118 gid=2688 name=nobody geo="" (write) checksum type: adler checksum hex: 50920ce2 requested-checksum hex: -none- 150722 10:19:57 time=1437553197.519257 func=close level=INFO logid=c2c2c586-3049-11e5-899b-c860001bd93a [email protected]:1095 tid=00007fe6c61fc700 source=XrdFstOfsFile:2267 tident=root.3632:508@eosmon02 sec= uid=18118 gid=2688 name=nobody geo="" Return code rc=0. If there is some puppet error for missing dependencies:

1 df

2 puppet agent -tv

4 lemon-host-check

5 yum update

6 cat /etc/yum-puppet.repos.d/castor.repo

7 rpm -qa | grep castor

8 tail /var/log/messages

9 yum clean all

10 yum update -y

11 rpm -qa | grep castor

12 castor -v

13 xrootd -version

14 cat /etc/yum-puppet.repos.d/xroot-stable.repo

15 cat /etc/puppet/puppet.conf

16 puppet agent -tv

18 xrootd -version

20 yum clean all

21 yum update

22 cat /etc/yum-puppet.repos.d/xroot-stable.repo

23 rpm -qa | grep xroot

24 yum install xrootd-client

25 uname -a

26 yum search xrootd-client

27 cat /etc/yum-puppet.repos.d/xroot-stable.repo

28 yum install xrootd

29 locate versionlock

30 cat /etc/yum/pluginconf.d/versionlock.list

31 rm /etc/yum/pluginconf.d/versionlock.list; touch /etc/yum/pluginconf.d/versionlock.list 32 yum update -y

33 yum install xrootd xrootd-client xrootd-server xrootd-debuginfo

34 puppet agent -tv

INVESTIGATION LOST OR BROKEN FILE:

In this case we see that the file had only one replica, not accessible because as we see from the logs it was deleted few days before. The problem was that the action has not been synchronized with the system.

CREATE shows the moment when the file and the replica has been created.

1) Do eos file check and info on the problematic file:

[root@eosalice-srv-b1 ~]# eos file info /eos/alice/grid/15/37779/4b79c22a-c835-11e5-bec9-6b6d927ddb57 --fullpath

File: '/eos/alice/grid/15/37779/4b79c22a-c835-11e5-bec9-6b6d927ddb57' Flags: 0664

Size: 3775514

Modify: Sun Jan 31 18:58:04 2016 Timestamp: 1454263084.687497000

Change: Sun Jan 31 18:57:57 2016 Timestamp: 1454263077.433517157

CUid: 10367 CGid: 1395 Fxid: 238e4f98 Fid: 596529048 Pid: 18653 Pxid: 000048dd

XStype: adler XS: 94 a4 57 7f ETAG: 160129547017125888:94a4577f replica Stripes: 2 Blocksize: 4k LayoutId: 00600112

#Rep: 1

# fs-id #......

# host # schedgroup # path # boot # configstatus # drain # active # geotag

#......

0 8165 lxfsrk60c01.cern.ch default.17 /data11 booted ro nodrain online 0513 /data11/0000e904/238e4f98

*******

2) Start to investigate in the headnode and in the machine where is supposed to be the replica:

- grep the FXID in the xrdlog of the FST

REPLICA 1:

[root@lxfsrk60c01 ~]#

[root@lxfsrk60c01 ~]# zgrep 238e4f98 /var/log/eos/fst/xrdlog.fst-2016020*

/var/log/eos/fst/xrdlog.fst-20160201.gz:160131 18:57:57 time=1454263077.614931 func=open level=INFO logid=28e888e0-c844-11e5-a744- 0cc47a691028 [email protected]:1095 tid=00007f3db52fa700 .....

/var/log/eos/fst/xrdlog.fst-20160201.gz:160131 18:57:57 time=1454263077.615103 func=open level=INFO logid=28e888e0-c844-11e5-a744- 0cc47a691028 [email protected]:1095 tid=00007f3db52fa700 source=XrdFstOfsFile:534 tident=6513.13273:[email protected] sec=unix uid=0 gid=0 name=alise geo="" capability=&mgm.access=create&mgm.ruid=10367&mgm.rgid=1395&mgm.uid=10367&mgm.gid=1395&mgm.path=/eos/alice/grid/15/3777 9/4b79c22a-c835-11e5-bec9-6b6d927ddb57&mgm.manager=eosalice-srv- b1.cern.ch:1094&mgm.fid=238e4f98&mgm.cid=18653&mgm.sec=unix|alise|lxaliproxy1.gsi.de||asteg|||&mgm.lid=6291730&mgm.bookingsiz e=3775514&mgm.fsid=8165&mgm.url0=root://lxfsrk60c01.cern.ch:1095//&mgm.fsid0=8165&mgm.url1=root://lxfsrk36c01.cern.ch:10 95//&mgm.fsid1=8710&cap.valid=1454266677 /var/log/eos/fst/xrdlog.fst-20160201.gz:160131 18:57:57 time=1454263077.615177 func=open level=INFO ......

/var/log/eos/fst/xrdlog.fst-20160208.gz:160208 00:08:45 time=1454886525.195531 func=_rem level=INFO logid=0825f7fe-9ccb-11e5-91ef- 002590643ab8 [email protected]:1095 tid=00007f3dd31fd700 source=XrdFstOfs:1132 tident= sec= uid=0 gid=0 name= geo="" fstpath=/data11/0000e904/238e4f98

REPLICA 2:

[root@lxfsrk36c01 ~]# zgrep 238e4f98 /var/log/eos/fst/xrdlog.fst-2016020*

/var/log/eos/fst/xrdlog.fst-20160201.gz:160131 18:57:57 time=1454263077.709528 func=open level=INFO [email protected]:1095 tid=00007f63269fd700 source=XrdFstOfsFile:534 tident=daemon.3998:392@lxfsrk60c01 sec=sss uid=0 gid=0 name=daemon geo="" capability=&mgm.access=create&mgm.ruid=10367&mgm.rgid=1395&mgm.uid=10367&mgm.gid=1395&mgm.path=/eos/alice/grid/15/3777 9/4b79c22a-c835-11e5-bec9-6b6d927ddb57&mgm.manager=eosalice-srv- b1.cern.ch:1094&mgm.fid=238e4f98&mgm.cid=18653&mgm.sec=unix|alise|lxaliproxy1.gsi.de||asteg|||&mgm.lid=6291730&mgm.bookingsiz e=3775514&mgm.fsid=8165&mgm.url0=root://lxfsrk60c01.cern.ch:1095//&mgm.fsid0=8165&mgm.url1=root://lxfsrk36c01.cern.ch:109 5//&mgm.fsid1=8710&cap.valid=1454266677

/var/log/eos/fst/xrdlog.fst-20160201.gz:160131 18:57:57 time=1454263077.727593 func=open level=INFO ...... 4263144.041599 func=MgmSyncer level=INFO logid=static...... [email protected]:1095 tid=00007f63441f9700 source=MgmSyncer:84 tident= sec=(null) uid=0 gid=0 name=- geo="" fid=238e4f98 mtime=1454263077

/var/log/eos/fst/xrdlog.fst-20160208.gz:160208 00:10:04 time=1454886604.975438 func=_rem level=INFO logid=529841b4-78a1-11e5- b755-0025903c4212 [email protected]:1095 tid=00007f63455fd700 source=XrdFstOfs:1132 tident= sec= uid=0 gid=0 name= geo="" fstpath=/data02/0000e904/238e4f98

HEADNODE:

[root@eosalice-srv-b1 ~]# zgrep 15/37779/4b79c22a-c835-11e5-bec9-6b6d927ddb57

/var/log/eos/mgm/xrdlog.mgm-2016-02-08-1454*

/var/log/eos/mgm/xrdlog.mgm-2016-02-08-1454896861.gz:160208 02:10:03 time=1454893803.394384 func=_rem level=INFO logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx [email protected]:1094 tid=00007f45e4410700 source=Rm:112 tident= sec=unix uid=10367 gid=1395 name=alienmaster geo="" ....

/var/log/eos/mgm/xrdlog.mgm-2016-02-08-1454896861.gz:160208 02:44:57 time=1454895897.945189 func=Emsg level=ERROR logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx [email protected]:1094 tid=00007f45d7e6e700 source=XrdMgmOfs:543 tident= sec=(null) uid=0 gid=0 name=(null) geo="" Unable to remove /eos/alice/grid/15/37779/4b79c22a-c835-11e5-bec9- 6b6d927ddb57; No such file or directory

CASTOR-SRM:

- service srmfed status

- service srmfed stop; service srmbed stop; service srmbed start; service srmfed start

- grep "LVL=E" /var/log/castor/srmfed.log

To excecute the probe test:

Log in --> root@castormon01

SRM2_ENDPOINTS=[instance] srmtimeout=240 conntimeout=120 sndtimeout=120 /usr/sbin/SRM-sls-probe.sh

In case of SRM failure:

- check the logs --> /var/log/castor/srmfed.log (Front End Daemon)

/var/log/castor/srmbed.log (Back End Daemon)

- login --> srm-public

- c2publicsrm1(or 2)

- Check for errors.

In case CASTOR instance is down:

- tail -f /var/log/castor/transfermanagerd.log

- grep LVL=E /var/log/castor/rhd.log (Request Header Deamon)

- “ “ /nsd.log (NameServer Deamon)

- “ “ /stagerd.log

- service c2probe status

- listtransfers -px or listtransfers -sx [pool]

- listtransfers | awk '{$3}' | sort | uniq -c | sort -nr --> how to find a user that is hammerig the system

- Check if a user of another pool is doing lots of transfers to another pool, like requests from NTOF to Default, this operation can overload the system.

EOS SRM problems:

- check logs: tail -f /var/log/bestman2/bestman2.log tail -f /var/log/bestman2/event.srm.log

- restart also eosd: service bestman2 status service bestman2 stop service eosd restart service bestman2 restart

Check the status of the daemon:

[root@srm-eoslhcb02 ~]# /etc/init.d/bestman2 status bestman2 (pid 4352) is running...

1) Check if there are zombie (ps -el | grep 'Z') processes piling up, if present, restart the daemon

2) If in the logs there are Java.IO.Exeptcion, restart the daemon

[root@srm-eoslhcb02 ~]# /etc/init.d/bestman2 restart

at org.eclipse..util..QueuedThreadPool$3.run(QueuedThreadPool.java:538) at java.lang.Thread.run(Thread.java:745)

Caused by: java.io.IOException: error=11, Resource temporarily unavailable at java.lang.ProcessBuilder.start(ProcessBuilder.java:1028)

... 53 more

[root@srm-eoslhcb02 ~]# [root@srm-eoslhcb02 ~]# /etc/init.d/bestman2 restart

Shutting down bestman2: [ OK ]

Starting bestman2: [ OK ]

[root@srm-eoslhcb02 ~]# tail -f /var/log/bestman2/bestman2.log

BeStMan-Jetty is ready.

### [srmPing] tid =qtp1346242466-38

=> incoming [srmPutDone()]

### [srmPutDone()] tid =qtp1346242466-67

In case of space or monitoring problems: INC1063771

[root@srm-eosatlas02 ~]# df eosmain 32320245237552 24477109779792 7843135457760 0% /eos

- Chech the fuse mount logs:

[root@srm-eosatlas02 ~]# tail -f /var/log/eos/fuse/fuse.main.log

- A restart of EOSD in this case solved the problem:

[root@srm-eosatlas02 ~]# df eosmain 32320245237552 24470740836864 7849504400688 76% /eos

How to implement the SSO on EOSCOCKPIT Machine: https://eoscockpit-quota.cern.ch/

Follow the guides: https://gitlab.cern.ch/ai/it-puppet-module-shibboleth/blob/qa/code/README.md http://linux.web.cern.ch/linux/scientific6/docs/shibboleth.shtml https://espace.cern.ch/authentication/CERN%20Authentication/Configure%20a%20Shibboleth%20Application.aspx https://espace.cern.ch/authentication/CERN%20Authentication/Home.aspx

Managing SSO: https://sso-management.web.cern.ch/SSO/ListSSOApplications.aspx

SSO Still need to be enabled in the machine

1) Installed packages: yum install shibboleth log4shib xmltooling-schemas opensaml-schemas

2) Changed Selinux permissions from "SELINUX=enforcing" to "SELINUX=permissive"

3) Enable automatic startup of shibboleth daemon --> /sbin/chkconfig --levels 345 shibd on

4) Copy the config files in /etc/shibboleth/

5) Edit /etc/shibboleth/shibboleth2.xml and add -->

6) Replace ALL 5 occurences of somehost.cern.ch, by your system hostname

7) Configure the /etc/httpd/conf.d/shib.conf

8) Intall httpd and its module --> yum install httpd mod_ssl 9) Commented all the following file --> /etc/httpd/conf.d/shib.conf

10) Request a new certificate using OpenSSL (for Linux machines): https://ca.cern.ch/ca/host/HostSelection.aspx?template=ee2host&instructions=openssl

11)Create a new certificate and key and put it in /etc/httpd/conf.d/ssl.conf

12) Check the iptables --> iptables -vL -n

13) Add the following line to /etc/sysconfig/iptables in order to open the door to the firewall:

-A INPUT -p tcp -m multiport --ports 443 -m comment --comment "110 HTTPS" -j ACCEPT

14)service iptables restart

Now the problem are all the link in the different pages that refer to http addresses instead of HTTPS, so the object will be blocked.

[root@eoscockpit-quota quotas]# grep -ri eoscockpit-quota .

15) Resolve the Security Issues:

Please resolve the following issues:

1) http TRACE XSS attack:

How: With newer versions of Apache 1.3.34 or later or 2.0.55 or later, you can just add the following line in /etc/httpd/conf/httpd.conf

TraceEnable off

2) Check for SSL Weak Ciphers: /etc/httpd/conf.d/ssl.conf

SSLProtocol all -SSLv2 -SSLv3

SSLCipherSuite ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM- SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:DHE- RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128- SHA256:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES256- SHA384:ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-RSA- AES256-SHA256:DHE-RSA-AES256-SHA:ECDHE-ECDSA-DES-CBC3-SHA:ECDHE-RSA-DES-CBC3-SHA:EDH-RSA-DES-CBC3- SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:DES-CBC3- SHA:!DSS

SSLHonorCipherOrder on

3) Deprecated SSLv2 and SSLv3 Protocol Detection

16) Test is the security issues are solved:

- openssl s_client -connect eoscockpit-quota.cern.ch:443 -ssl2 - openssl s_client -connect eoscockpit-quota.cern.ch:443 -ssl3

- openssl s_client -connect eoscockpit-quota.cern.ch:443 -tls1

- curl -X TRACE eoscockpit-quota.cern.ch

Restart the services:

/sbin/service shibd restart

/sbin/service httpd restart

Guide Openstack: http://clouddocs.web.cern.ch/clouddocs/index.html

Modificare tutti gli staticfiles e gli index.html

Controllare se le subdirectory di EOSCMS sono corrette o no / provare ad elimiare una subdirectory e vedere se viene ricreata (dopo backup)

[root@eoscockpit-quota ~]# find /var/www/thttpd/quotas/eoscms/

How to find all the "http" find /var/www/thttpd/ -type f | xargs grep eoscockpit-quota | grep -v https

History:

1st September --> Modified /afs/cern.ch/project/eos/operations/brw_quota/templates.py --> modifying all the http to https

14th September --> Updated /var/www/thttpd/index.html --> modifying all the // to "https"

Updated /var/www/thttpd/staticfiles/sls.html --> modifying all the http to https

Updated /var/www/thttpd/staticfiles/eos_monthly_view.html

Pages not working: https://eoscockpit-quota.cern.ch/quotas/eoscms/eos.html https://eoscockpit-quota.cern.ch/quotas/eosuser/eos.html https://eoscockpit-quota.cern.ch/staticfiles/eos_monthly_view.html

Other pages: vi /etc/httpd/conf/httpd.conf

[root@eoscockpit-quota thttpd]# cat /afs/cern.ch/project/eos/operations/generate_eoscockpit_browsable_quota.sh

#!/bin/sh

env XrdSecPROTOCOL=krb5 EOS_MGM_URL=eosatlas.cern.ch eos -b -r 3 4 quota ls -m > /scratch/eosquota/eosatlas_keyvalue.quota python2.6 /afs/cern.ch/project/eos/operations/brw_quota/parseeosquota.py /scratch/eosquota/eosatlas_keyvalue.quota /var/www/thttpd/quotas env XrdSecPROTOCOL=krb5 EOS_MGM_URL=eoscms.cern.ch eos -b -r 3 4 quota ls -m > /scratch/eosquota/eoscms_keyvalue.quota python2.6 /afs/cern.ch/project/eos/operations/brw_quota/parseeosquota.py /scratch/eosquota/eoscms_keyvalue.quota /var/www/thttpd/quotas

env XrdSecPROTOCOL=krb5 EOS_MGM_URL=eospublic.cern.ch eos -b -r 3 4 quota ls -m > /scratch/eosquota/eospublic_keyvalue.quota python2.6 /afs/cern.ch/project/eos/operations/brw_quota/parseeosquota.py /scratch/eosquota/eospublic_keyvalue.quota /var/www/thttpd/quotas

#env XrdSecPROTOCOL=krb5 EOS_MGM_URL=eospps.cern.ch eos -b -r 3 4 quota ls -m > /scratch/eosquota/eospps_keyvalue.quota

#python2.6 /afs/cern.ch/project/eos/operations/brw_quota/parseeosquota.py /scratch/eosquota/eospps_keyvalue.quota /var/www/thttpd/quotas

env XrdSecPROTOCOL=krb5 EOS_MGM_URL=eoslhcb.cern.ch eos -b -r 3 4 quota ls -m > /scratch/eosquota/eoslhcb_keyvalue.quota python2.6 /afs/cern.ch/project/eos/operations/brw_quota/parseeosquota.py /scratch/eosquota/eoslhcb_keyvalue.quota /var/www/thttpd/quotas

env XrdSecPROTOCOL=krb5 EOS_MGM_URL=eosuser.cern.ch eos -b -r 3 4 quota ls -m > /scratch/eosquota/eosuser_keyvalue.quota python2.6 /afs/cern.ch/project/eos/operations/brw_quota/parseeosquota.py /scratch/eosquota/eosuser_keyvalue.quota /var/www/thttpd/quotas

[root@eoscockpit-quota thttpd]# ls -l /scratch/eosquota/eosuser_keyvalue.quota

[root@eoscockpit-quota ~]# grep http /afs/cern.ch/project/eos/operations/brw_quota/parseeosquota.py [root@eoscockpit-quota ~]# cat /scratch/eosquota/eosuser_keyvalue.quota

[root@eoscockpit-quota ~]# ls -l /afs/cern.ch/project/eos/operations/brw_quota/templates.py

[root@eoscockpit-quota ~]# ls -l /afs/.cern.ch/project/eos/operations/brw_quota/templates.py_backup2016

[root@eoscockpit-quota ~]# ls -l /afs/.cern.ch/project/eos/operations/brw_quota/

[afiorot@aiadm721 ~]$ afs_admin vos_release p.eos.root var/www/thttpd/quotas/eosatlas/eos/

------

Security Issues: How to

PROBLEMS WITH EOSCOCKPIT QUOTA: apachectl -V

Summary:

Debugging functions are enabled on the remote HTTP server.

The remote webserver supports the TRACE and/or TRACK methods. TRACE and TRACK are HTTP methods which are used to debug web server connections.

It has been shown that servers supporting this method are subject to cross-site-scripting attacks, dubbed XST for Cross-Site-Tracing, when used in conjunction with various weaknesses in browsers.

An attacker may use this flaw to trick your legitimate web users to give him their credentials.

Solution:

Disable these methods.

How:

With newer versions of Apache 1.3.34 or later or 2.0.55 or later, you can just add the following line in /etc/httpd/conf/httpd.conf

......

Summary:

This routine search for weak SSL ciphers offered by a service.

Vulnerability Detection Result:

Weak ciphers offered by this service:

SSL3_RSA_RC4_128_MD5

SSL3_RSA_RC4_128_SHA

SSL3_ECDHE_RSA_WITH_RC4_128_SHA

TLS1_RSA_RC4_128_MD5

TLS1_RSA_RC4_128_SHA

TLS1_ECDHE_RSA_WITH_RC4_128_SHA

Solution:

The configuration of this services should be changed so that it does not support the listed weak ciphers anymore.

How:

[root@eoscockpit-quota ~]# grep SSLCi /etc/httpd/conf.d/ssl.conf

SSLCipherSuite DEFAULT:!EXP:!SSLv2:!DES:!IDEA:!SEED:+3DES

Open your httpd.conf or ssl.conf file and search for the SSLCipherSuite directive. If you can’t find it anywhere, you can just add it, otherwise, replace it with the following:

SSLProtocol all -SSLv2 -SSLv3

SSLHonorCipherOrder on

SSLCipherSuite "EECDH+ECDSA+AESGCM EECDH+aRSA+AESGCM EECDH+ECDSA+SHA384 EECDH+ECDSA+SHA256 EECDH+aRSA+SHA384 EECDH+aRSA+SHA256 EECDH+aRSA EECDH EDH+aRSA !aNULL !eNULL !LOW !3DES !MD5 !EXP !PSK !SRP !DSS"

......

Summary: It was possible to detect the usage of the deprecated SSLv2 and/or SSLv3 protocol on this system.

Vulnerability Detection Result:

In addition to TLSv1+ the service is also providing the deprecated SSLv3 protoco! l and supports one or more ciphers.

Solution:

It is recommended to disable the deprecated

SSLv2 and/or SSLv3 protocols in favor of the TLSv1+ protocols. Please see the references for more information.

XRDFED:

Xroot Regional Redirector for:

- FAX (Atlas)

- AAA (CMS)

Composition:

- Each host has two deamons:

- CMSD (cluster management service)

- Xroot

Usage:

Data access failure recovery actions

| JOBS | → | Local Copy | → | Failure ?? | – (if yes) → | Locate file via the federated Infrastructure |

http://configdocs.web.cern.ch/configdocs/index.html

Monitoring and documentation:

Kibana → xrdfed https://cern.service-now.com/service-portal/sls.do → Load Balancing section https://gitlab.cern.ch/ai/it-puppet-hostgroup-xrdfed/activity → GitLab

Hiera

Puppet

XrootD https://twiki.cern.ch/twiki/bin/viewauth/DSSGroup/XrdfedService → Twiki https://twiki.cern.ch/twiki/bin/view/Main/CmsXrootdArchitecture

Activity:

[root@xrdcmstzero02 ~]# tail -f /var/log/xrootd/cmstzero/xrootd.log --> check activity in the logs

Network activity: iptraf-ng → Interactive Colorful IP LAN Monitor iftop → display bandwidth usage on an interface by host host --> DNS lookup utility → host cms-xrd-tzero.cern.ch lbclient -d → load balancing utility

Services status: systemctl status cmsd@[redirector] systemctl status xrootd@[redirector]

Update of xrootd in xrdfed machines:

How to check the xrootd version currently in use:

[root@xrdatlasde02 ~]# grep "xrd version" /var/log/xrootd/fed/xrootd.log

160331 03:29:02 441 Copr. 2004-2012 Stanford University, xrd version v4.2.3

How to check the last xrootd package installed: rpm -qa | grep xrootd

How to check the last yum updates in the machine:

[root@xrdatlasde02 ~]# tail -50 /var/log/yum.log

......

1) login in the machine

2) puppet run

3) roger status intervention

4) check that the node is going out of the alias: find alias in cat /etc/motd

- lbclient -d (to be run in the node; if it is the only one there will be no alias)

- nslookup ALIAS

- host ALIAS

- timber.cern.ch (put the alias) 5) ip addr show

when host is out of the alias:

7) yum update -y

8) restart cmsd and xrootd daemons

9) check the services: systemctl status xrootd@[active xroot redirector] systemctl status cmsd@[active xroot redirector]

10) lemon-host-check

11) puppet agent -tv

12) roger in production

13) check that the host is again in the alias

14) Check daemons and logs to verify that activity os ongoing

In case of problems with alias or something else debug with --> lbclient -d

RUNDECK:

List of machines:

[afiorot@wlustagemanu ~]$ ls -l /var/rundeck/projects/EOS-PPS/etc/resources.xml bash-4.1$ cat /tmp/rundeck.xml

How to install plugins:

List of installed plugins → [afiorot@wlustagemanu ~]$ ls -l /var/lib/rundeck/libext/ https://github.com/cernops/rundeck-exec-kerberos

Kerberos commands: kinit klist kdestroy

http://linux.web.cern.ch/linux/docs/kerberos-access.shtml

Create a kerberos keytab for a service account: ktutil ktutil: add_entry -password -p [email protected] -k 1 -e arcfour-hmac-md5 ktutil: add_entry -password -p [email protected] -k 1 -e aes256-cts ktutil: wkt proctc.keytab ktutil: quit

Test the keytab: kinit -kt /etc/krb5.keytab.dssview dssview klist

If it does not work because of some errors, check permissions and ownership. Probably will helps change permissions from root to user or allow the kbr5 to be read by everybody [NOT SECURE]

[afiorot@wlustagemanu ~]$ ls -l /etc/krb5.keytab.dssview

-rw------. 1 root root 124 Jun 1 15:25 /etc/krb5.keytab.dssview –> to be changed to afiorot:root

Set the username and the keytab path in the Rundeck settings,and specify nodename and the user that will access the machines, in this case → samba-client and root (to be set in the resources .xml)

GITLAB:

- Code → code

- Data → variables

[afiorot@aiadm060 ~]$ git clone https://:@gitlab.cern.ch:8443/ai/it-puppet-hostgroup-xrdfed.git

[afiorot@aiadm060 ~]$ cd it-puppet-hostgroup-xrdfed

[afiorot@aiadm060 it-puppet-hostgroup-xrdfed]$ git status

# On branch qa nothing to commit (working directory clean) [afiorot@aiadm060 it-puppet-hostgroup-xrdfed]$ git checkout master

Branch master set up to track remote branch master from origin by rebasing.

Switched to a new branch 'master'

[afiorot@aiadm060 it-puppet-hostgroup-xrdfed]$ git merge --no-ff qa

Auto-merging code/manifests/init.pp

Auto-merging data/hostgroup/xrdfed/cms.yaml

CONFLICT (content): Merge conflict in data/hostgroup/xrdfed/cms.yaml

Automatic merge failed; fix conflicts and then commit the result.

[afiorot@aiadm060 it-puppet-hostgroup-xrdfed]$ vi data/hostgroup/xrdfed/cms.yaml

[afiorot@aiadm060 it-puppet-hostgroup-xrdfed]$ git status

# On branch master

# Changes to be committed:

#

# modified: code/manifests/init.pp

# modified: data/hostgroup/xrdfed.yaml

# modified: data/hostgroup/xrdfed/atlas.yaml

# modified: data/hostgroup/xrdfed/monitor.yaml

#

# Unmerged paths:

# (use "git add/rm ..." as appropriate to mark resolution)

#

# both modified: data/hostgroup/xrdfed/cms.yaml

#

[afiorot@aiadm060 it-puppet-hostgroup-xrdfed]$ git add data/hostgroup/xrdfed/cms.yaml

[afiorot@aiadm060 it-puppet-hostgroup-xrdfed]$ git commit

[master ad8181b] Merge branch 'qa'

[afiorot@aiadm060 it-puppet-hostgroup-xrdfed]$ git status

# On branch master

# Your branch is ahead of 'origin/master' by 4 commits. # nothing to commit (working directory clean)

[afiorot@aiadm060 it-puppet-hostgroup-xrdfed]$ git push origin master

Counting objects: 1, done.

Writing objects: 100% (1/1), 254 bytes, done.

Total 1 (delta 0), reused 0 (delta 0)

To https://:@gitlab.cern.ch:8443/ai/it-puppet-hostgroup-xrdfed.git

56e50b1..ad8181b master -> master

How to check ACLs and permissions:

[root@xrdatlasuk02 ~]# cat /etc/xrootd/xrootd-fed.cfg

# Country specific cms.allow host *.ac.uk

exception.XrootdManagerStuck:

Check which process is stuck: tail -50 /var/log/xrootd/fed/xrootd.log

tail -50 /var/log/xrootd/fed/cmsd.log

In this case we found cmsd being stuck: service cmsd@fed status service cmsd@fed restart

Check that the service is running again: tail -f /var/log/xrootd/fed/cmsd.log lemon-host-check --sh

2) On another case we had problem with name resolution: [root@xrdatlases02 ~]# tail -f /var/log/xrootd/fed/cmsd.log

160420 08:54:18 4557 XrdOpen: Unable to connect socket to xrdatlaseu01.cern.ch; connection timed out

160420 08:56:24 4560 XrdOpen: Unable to connect socket to xrdatlaseu02.cern.ch; no route to host

160420 08:59:30 4557 XrdOpen: Unable to connect socket to xrdatlaseu01.cern.ch; no route to host

- grepping the PID of cmsd and checking with lsof we see that the connection between xrdatlases02 and xrdatlaseu01 (on the upper level of the tree) was not established:

cmsd 4529 xrootd 29u IPv6 16680557 0t0 TCP xrdatlases02.cern.ch:41675->xrdatlaseu01.cern.ch:rmiactivation (SYN_SENT) cmsd 4529 xrootd 30u IPv6 16663304 0t0 TCP xrdatlases02.cern.ch:rmiactivation->t2fax.ific.uv.es:40113 (ESTABLISHED) cmsd 4529 xrootd 31u IPv6 16681178 0t0 TCP xrdatlases02.cern.ch:55756->xrdatlaseu02.cern.ch:rmiactivation (SYN_SENT)

- trying to go to check on the xrdatlaseu02 host we see that there is some problem with name resolution by doing:

[root@xrdatlaseu02 ~]# host xrdatlases02.cern.ch

;; connection timed out; trying next origin

;; connection timed out; no servers could be reached

- Checking in the cmsd.log we see that the server does not accept connection from the other host:

[root@xrdatlaseu01 ~]# less /var/log/xrootd/fed/cmsd.log

160420 09:26:51 13044 XrdAccept: Unable to accept TCP connection from [::ffff:128.142.154.243]; permission denied

160420 09:27:00 28194 XrdAccept: Unable to accept TCP connection from [::ffff:128.142.154.243]; permission denied

160420 09:27:09 1933 XrdAccept: Unable to accept TCP connection from [::ffff:128.142.154.243]; permission denied

160420 09:27:18 27163 XrdAccept: Unable to accept TCP connection from [::ffff:128.142.154.243]; permission denied

160420 09:27:27 28200 XrdAccept: Unable to accept TCP connection from [::ffff:128.142.154.243]; permission denied

160420 09:27:36 28183 XrdAccept: Unable to accept TCP connection from [::ffff:128.142.154.243]; permission denied

160420 09:27:45 13040 XrdAccept: Unable to accept TCP connection from [::ffff:128.142.154.243]; permission denied 160420 09:27:54 28188 XrdAccept: Unable to accept TCP connection from [::ffff:128.142.154.243]; permission denied

160420 09:28:03 13044 XrdAccept: Unable to accept TCP connection from [::ffff:128.142.154.243]; permission denied

160420 09:28:12 28194 XrdAccept: Unable to accept TCP connection from [::ffff:128.142.154.243]; permission denied

160420 09:28:21 1933 XrdAccept: Unable to accept TCP connection from [::ffff:128.142.154.243]; permission denied

160420 09:28:30 27163 XrdAccept: Unable to accept TCP connection from [::ffff:128.142.154.243]; permission denied

160420 09:28:39 28200 XrdAccept: Unable to accept TCP connection from [::ffff:128.142.154.243]; permission denied

160420 09:28:39 13040 manager.839:24@xrdatlasfr02:1094 do_Have: /atlas/rucio/mc15_13TeV:AOD.07120977._000088.pool.root.1

160420 09:30:12 25163 Starting on Linux 3.10.0-327.10.1.el7.x86_64

Copr. 2004-2012 Stanford University, xrd version v4.3.0

++++++ cmsd [email protected] initialization started.

Config using configuration file /etc/xrootd/xrootd-fed.cfg

=====> xrd.port 1098 if exec cmsd

=====> all.adminpath /var/spool/xrootd

=====> all.sitename CERN-PROD

=====> xrd.report uct2-int.mwt2.org:9932,localhost:3333 every 60s all -buff -poll sync

Config maximum number of connections restricted to 16384

Config maximum number of threads restricted to 7214

Copr. 2007 Stanford University/SLAC cmsd.

++++++ [email protected] phase 1 initialization started.

=====> all.adminpath /var/spool/xrootd

=====> all.role meta manager

=====> all.export /atlas

160420 09:33:13 25163 Config: Unable to add host atlas-xrd-eu.cern.ch ; Name or service not known

- In this case we opened a ticket to che Cloud Infrastructure → INC1008218

How to update GNI reference KB in puppet:

Example with CASTOR: https://gitlab.cern.ch/ai/it-puppet-hostgroup-castor/blob/qa/data/hostgroup/castor.yaml

Change all the respective KB interested and save in QA.

After that ask the Service Manager to push in in Master.

How to rebuild a Virtual Machine keeping the same IP addres:

- put the machine out of the alias setting it in DISABLED status through ROGER

- Reinstall it with → ai-rebuild-vm cc7 [hostname]

- the command “rebuild” permits to keep the IP address of the machines's

- the option “cc7” permits to install the latest version available of the

Contextualization: It is the process where the machine is put in a specific contest setting different parameters. It is possible to check which will be the different settings and parameters checking the puppetinit file: less /var/lib/cluods/instance/scripts/puppetinit → it is executed at the first run of the machines

Moreover, it is possible to check what is happening on the machine with –> journalctl -f

After the contextualization is completed, we need to change a parameter in the puppet.conf.

This needs to be done because when the VM was installed in the past it was set an enviroenment that does no longer exist, therefore we need to update it:

vi /etc/puppet/puppet.conf –> change env to → production and add the following line at the bottom –> stringify_facts = false

The stringify option is needed in order to make puppet understand the different flags present.

From aiadm:

[afiorot@aiadm704 ~]$ ai-rebuild-vm c2stress17.cern.ch –cc7

Trying to rebuild 'c2stress17.cern.ch'...

VM tenant: Personal afiorot

Couldn't rebuild VM ('c2stress17' not found)

In case of this error, change and set the right tenant:

[afiorot@aiadm712 ~]$ env | grep OS_

OS_PROJECT_DOMAIN_ID=default

OS_PROJECT_NAME=Personal afiorot

OS_IDENTITY_API_VERSION=3

OS_AUTH_TYPE=v3kerberos

EOS_MGM_URL=root://eospps.cern.ch OS_AUTH_URL=https://keystone.cern.ch/krb/v3

[afiorot@aiadm712 ~]$ export OS_PROJECT_NAME='IT CASTOR Stress Test'

Try the installation again and then check the status with: openstack server show c2stress17

SAMBA:

NFS:

Excellent for sharing between linux and other Unix System. Incompatible with Windows Clients, and is useless for Mac file sharing due to missing features, and compatibility and performance problems with Mac apps.

WebDAV:

WebDAV (or web Distributed Authoring and Versioning) is a way of accessing files and folders stored on another computer. In terms of functionality, it stands between the well-known and very basic FTP protocol and the more powerful native filesystems (e.g. SMB/CIFS for Windows, NFS/AFS for Linux)

Web-based Distributed Authoring and Versioning (WebDAV) is a set of methods based on the Hypertext Transfer Protocol (HTTP) that facilitates collaboration between users in editing and managing documents and files stored on World Wide Web servers.

DFS:

DFS (Distributed File System) provides the possibility to offer a reliable, redundant and replicated file system that is logically accessible and that is spanning over a large number of independent servers.

SAMBA: great choise due to windows compatibility, implementation of SMB/CIFS protocols

Samba is an important component to seamlessly integrate Linux/Unix Servers and Desktops into Active Directory environments. It can function both as a domain controller or as a regular domain member. Samba is based on the common client/server protocol of Server Message Block (SMB) and Common File System (CIFS). Using client software that also supports SMB/CIFS (for example, most Microsoft Windows products), an end user sends a series of client requests to the Samba server on another computer in order to open that computer's files, access a shared printer, or access other resources. The Samba server on the other computer responds to each client request, either granting or denying access to its shared files and resources.

SAMBA/EOS:

We use SAMBA as a gateway on top of EOSUSER (CERNBOX) through the FUSE mount. It works as a Domain Member of the Active Directory Environments in the CERN Infrastructure.

An open source implementation of the SMB file sharing protocol that provides file and print services to SMB/CIFS clients. Samba allows a non-Windows server to communicate with the same networking protocol as the Windows products.

The Samba SMB/CIFS client is called smbclient.

CERNBox with SAMBA on the TOP

HOW TO CREATE A SAMBA SERVER IN ORDER TO MAKE EOS FUSEMOUNT VISIBLE ALSO FOR WINDOWS MACHINES

How to install SAMBA on a physical machine:

Documentation CENTOS 7: https://www.howtoforge.com/samba-server-installation-and-configuration-on-centos-7 http://www.unixmen.com/install-configure-samba-server-centos-7/

Samba-HOWTO-Collection.pdf smbd -b → to check all the builtin options

SNOW Tickets:

INC0987875 → net ads -U

INC0997654 → Kerberos problem using "smbclient -k -L //hostname/"

IMPLEMENT LDAP AND KERBEROS AUTH http://linux.web.cern.ch/linux/docs/kerberos-access.shtml

PERMISSIONS: http://www.cyberciti.biz/tips/how-do-i-set-permissions-to-samba-shares.html

INSTALL SAMBA AND CREATE A STANDALONE SERVER:

1) Install the machine in CC7

2) Install the necessary packages --> yum install samba-client.x86_64 samba-common.x86_64 samba-python.x86_64 samba-winbind.x86_64 samba.x86_64 smbldap-tools.noarch samba-libs.x86_64 samba-winbind-clients

Packages currently installed:

[root@p05614910f69219 ~]# rpm -qa | grep samba samba-common-tools-4.2.10-7.el7_2.x86_64 samba-winbind-4.2.10-7.el7_2.x86_64 samba-client-libs-4.2.10-7.el7_2.x86_64 samba-python-4.2.10-7.el7_2.x86_64 samba-test-libs-4.2.10-7.el7_2.x86_64 samba-common-4.2.10-7.el7_2.noarch samba-4.2.10-7.el7_2.x86_64 samba-client-4.2.10-7.el7_2.x86_64 samba-common-libs-4.2.10-7.el7_2.x86_64 samba-winbind-clients-4.2.10-7.el7_2.x86_64 samba-libs-4.2.10-7.el7_2.x86_64 samba-winbind-modules-4.2.10-7.el7_2.x86_64 samba-test-4.2.10-7.el7_2.x86_64

3) Check that the smb.conf is present and create a backup one --> ls -l /etc/samba/smb.conf

4) Set SELinux as "disabled" --> vi /etc/sysconfig/selinux

5) Give permissions to SAMBA in the firewall: firewall-cmd --permanent --zone=public --add-service=samba firewall-cmd --reload

In case nothing works try to disable the firewall --> systemctl stop firewalld 6) Disable ip tables: systemctl status ip6tables.service; systemctl status iptables.service

7) Configure the /etc/samba/smb.conf (A*) and create a sharepoint:

[data] comment = Data path = /export read only = Yes guest ok = Yes

8) Create the directory of the sharepoint --> mkdir /export

9) Add user to the smbpasswd file --> smbpasswd -a afiorot

10) Change the owner of the directory --> chown afiorot.users /export

11) Change permission of the directory --> chmod u+rwx,g+rx,o+rx /export

12) Check if the smb.conf is well configured --> testparm /etc/samba/smb.conf

13) Start services --> systemctl status nmb.service; systemctl status smb.service; systemctl status winbind.service

14) Check the logs: tail -f /var/log/samba/log.nmbd tail -f /var/log/samba/log.smbd tail -f /var/log/samba/log.winbindd journalctl -f

15) Copy some files in /export

16) Try to connect from a linux machine --> smbclient -L //hostname

17) Try to connect from Windows and MAC machines --> Start --> Run.. --> \\IP-Address --> if the connection is successfull it is possible to mount a drive

CONFIGURE SAMBA AS A DOMAIN MEMBER OF THE CERN NETWORK:

18) Set /etc/samba/smb.conf in order to prepare the join in the CERN network (B*)

19) Fetch domain SID (Security Identifier) and store it into local secrets.tdb --> net rpc getsid -S CERNDC36.cern.ch

Storing SID *************************** for Domain CERN in secrets.tdb

20) Join the domain with (or in case of login problems): - cern-get-keytab --passwordsmb - - force

- systemctl restart sshd.service

- login from aiadm

- check in the logs what have changed: less /var/log/messages less /var/log/yum.log less /var/log/distro_sync.log ls -lart /var/log/distro_sync.log*

20b) How to check which ports are being used → netstat -tulpn | egrep "smbd|nmbd|winbind"

21) execute testparm ...smb.conf and restart services systemctl restart nmb.service; systemctl restart smb.service; systemctl restart winbind.service; systemctl status nmb.service; systemctl status smb.service; systemctl status winbind.service

22) Joined CERN domain. This set the server as an ACTIVE DIRECTORY DOMAIN MEMBER of CERN.CH in order to use Kerberos authentication

23) Check that the Kerberos login is working from Linux (It is needed to put “CERN.CH”), Windows and MAC:

Linux --> smbclient -k //p05614910f69219.cern.ch/data -U afiorot

Windows --> Run.. --> \\128.142.209.40

Mac --> Finder --> Go..--> Connect to Server --> smb://128.142.209.40

24) Create a mainfest and run Puppet agent -tv

25) Install EOS

26) Install FUSE: yum update http://dss-ci-repo.web.cern.ch/dss-ci-repo/eos/aquamarine/commit/el-7/x86_64/eos-fuse-0.3.155- 20160329gitaba9434.el7.x86_64.rpm

To know when an RPM has been updated check the following logs:

- /var/log/yum.log

- /var/log/distro_sync.log

26b) Mount the EOS instance

27) Run Puppet

28) Set parameters in smb.conf: security = ADS passdb backend = tdbsam realm = CERN.CH idmap config * : backend = nss idmap config * : range = 1000-200000 allow trusted domains = No winbind enum users = No winbind enum groups = No winbind nested groups = Yes template shell = /bin/bash winbind use default domain = Yes

#Security : CVE-2015-7560.html unix extensions = no

#Quota support: get quota command = /etc/eos_quota_samba.sh

The Name Service Switch (NSS) is a facility in Unix-like operating systems that provides a variety of sources for common configuration databases and name resolution mechanisms. These sources include local operating system files (such as /etc/passwd, /etc/group, and /etc/), the (DNS), the Network Information Service (NIS), and LDAP.

29) Follow the procedure to set Identity Mapping between SAMBA and LDAP: http://linux.web.cern.ch/linux/docs/account-mgmt.shtml

Sections:

 Enable sssd  Run sssd

30) Get a new keytab and restart the daemons:

- cern-get-keytab --passwordsmb --force - systemctl restart nmb.service; systemctl restart winbind.service; systemctl restart smb.service; systemctl restart sssd; systemctl restart nscd; systemctl status nmb.service; systemctl status winbind.service; systemctl status smb.service; systemctl status sssd; systemctl status nscd

31) Check in the logs if the ID Mapping is successfully working:

 FUSE Logs (in the samba node) → tail -f /var/log/eos/fuse/fuse.main.log  MGM Logs (in the eos hdnode )→ tail -f /var/log/eos/mgm/xrdlog.mgm

32) How to clean databases in case of login or mapping problems: stop all services rm -f /var/lib/samba/*tdb rm -f /var/lib/sss/mc/* rm -f /var/lib/sss/db/* net cache flush

STOP SERVICES: systemctl stop nmb.service; systemctl stop winbind.service; systemctl stop smb.service; systemctl stop sssd; systemctl stop nscd; service eosd stop; systemctl status nmb.service; systemctl status winbind.service; systemctl status smb.service; systemctl status sssd; systemctl status nscd; service eosd status

RESTART SERVICES: systemctl restart nmb.service; systemctl restart winbind.service; systemctl restart smb.service; systemctl restart sssd; systemctl restart nscd; service eosd restart; systemctl status nmb.service; systemctl status winbind.service; systemctl status smb.service; systemctl status sssd; systemctl status nscd; service eosd status; systemctl daemon-reload

In case of problems: strace -f -s 500 -p 49276 remove all databases and restart services check in the logs

33) IDMAP Configuration:

IDMAP USING NSS: idmap config * : range = 1000-200000 idmap config * : backend = nss

34) How to check the EOS fs config: cat -n /etc/rc.d/init.d/eosd cat /etc/sysconfig/eos cat /etc/fuse.conf

35) How to make ACLs working properly:

Disable puppet

1) Update to the last EOS client, fuse and debuginfo: yum install http://dss-ci-repo.web.cern.ch/dss-ci-repo/eos/aquamarine/tag/el-7/x86_64/eos-client-0.3.195-1.el7.x86_64.rpm http://dss-ci- repo.web.cern.ch/dss-ci-repo/eos/aquamarine/tag/el-7/x86_64/eos-debuginfo-0.3.195-1.el7.x86_64.rpm http://dss-ci- repo.web.cern.ch/dss-ci-repo/eos/aquamarine/tag/el-7/x86_64/eos-fuse-0.3.195-1.el7.x86_64.rpm

2) Check the current version:

[root@p05614910f69219 ~]# rpm -qa | grep eos eos-client-0.3.176-20160511git6ba4ade.el7.x86_64 eos-debuginfo-0.3.176-20160511git6ba4ade.el7.x86_64 eos-fuse-0.3.176-20160511git6ba4ade.el7.x86_64

3) Modify the /etc/sysconfig/eos:

[root@p05614910f69219 ~]# vim /etc/sysconfig/eos add the option:

# Create a mask for ACLs export EOS_FUSE_MODE_OVERLAY=007

4) Restart services

If Puppet is not disabled, after the next run the eos packages will be updated and the ACLs will stop to work.

36) Samba Optimization of performances:

Using HELIOS LAN TEST to perform different kind of test on the network

Handling large directories → https://www.samba.org/samba/docs/man/Samba-HOWTO-Collection/largefile.html

If the download of BIG files is done in my space (afiorot) the speed is 6MB/s while if I do it from IT-STORAGE it is 100MB/s

- in one case from /afiorot https://www.samba.org/samba/docs/man/Samba-HOWTO-Collection/speed.html

 Reduce the Log level at = 1  Cleanup TDB files  Share based in Wigner?  EOS rpm?

 Setting the next parameters increased the performances:

map archive = no

map hidden = no

map read only = no

map system = no

store dos attributes = no

Set → wide links= yes

It is a big mistake to set the wide links Samba parameter to no in the Samba configuration file /etc/smb.conf. This option, if set to no, tells Samba not to follow symbolic links outside of an area designated as being exported as a share point. In order to determine if a link points outside the shared area, Samba has to follow the link and then do a directory path lookup to determine where on the file system the link ended up. This ends up adding a total of six extra system calls per filename lookup, and Samba looks up filenames a lot. A test done was published that showed that setting this parameter will cause a 25 to 30 -percent slowdown in Samba performance.

Command for monitoring:

Process Monitoring:

Top – Linux Process Monitoring

Htop – Linux Process Monitoring → yum install htop.x86_64

smbstatus -v

Memory and System Monitoring:

dstat -cdgilmnprsty – replacement for vmstat, iostat, netstat, ifstat

- VmStat – Virtual Memory Statistics → yum install sysstat.x86_64

- Iostat – Input/Output Statistics → yum install sysstat.x86_64

- Netstat – Network Statistics

- Iotop – Monitor Linux Disk I/O

Glances – Real Time System Monitoring → yum install glances

Monit – Linux Process and Services Monitoring

Network Monitoring:

iftop -i enp4s0f0 -- Network Bandwidth Monitoring → yum install iftop

iptraf-ng -- Real Time IP LAN Monitoring → yum install iptraf

37) SAMBA Security: http://www.informit.com/library/content.aspx?b=red_hat_linux7&seqNum=167

38) SAMBA Monitoring:

 It is possible to monitor the activities on the server with → watch smbstatus -v  How to debug a process: ps aux | grep eosd root 91396 0.0 0.0 112648 972 pts/0 S+ 14:10 0:00 grep --color=auto eosd gdb eosd 91228

Meter Dashboard: https://meter.cern.ch/public/_plugin/kibana/#dashboard/temp/AVUVfYBSmlMFMGYYip6A

39) Eos Update with Puppet: https://gitlab.cern.ch/ai/it-puppet-hostgroup-box

It needs to be done only the first time:

[afiorot@aiadm070 ~]$ git clone https://:@gitlab.cern.ch:8443/ai/it-puppet-hostgroup-box.git

Initialized empty Git repository in /afs/cern.ch/user/a/afiorot/it-puppet-hostgroup-box/.git/ remote: Counting objects: 350, done. remote: Compressing objects: 100% (319/319), done. remote: Total 350 (delta 108), reused 0 (delta 0)

Receiving objects: 100% (350/350), 41.03 KiB, done.

Resolving deltas: 100% (108/108), done.

[afiorot@aiadm070 ~]$ cd it-puppet-hostgroup-box/

[afiorot@aiadm070 it-puppet-hostgroup-box]$ git checkout qa

Already on 'qa'

[afiorot@aiadm070 it-puppet-hostgroup-box]$ git pull

Current branch qa is up to date.

[afiorot@aiadm070 it-puppet-hostgroup-box]$ ls -l total 4 drwxr-xr-x. 4 afiorot c3 2048 Jun 21 09:31 code drwxr-xr-x. 3 afiorot c3 2048 Jun 21 09:31 data

[afiorot@aiadm070 it-puppet-hostgroup-box]$ ls -l data/hostgroup/box/ client_fusetest.yaml gateway.yaml samba.yaml webserver.yaml

[afiorot@aiadm070 it-puppet-hostgroup-box]$ find | grep -v .git

./code

./code/README.md

./code/files

./code/files/etc.sysconfig.

…......

[afiorot@aiadm070 it-puppet-hostgroup-box]$ vi code/manifests/samba.pp #Change the version of EOS we want

[afiorot@aiadm070 it-puppet-hostgroup-box]$ git add code/manifests/samba.pp

[afiorot@aiadm070 it-puppet-hostgroup-box]$ git commit #Add a comment to the change save commit with CTRL+X

[qa 6e80ed8] update eos fuse

1 files changed, 3 insertions(+), 3 deletions(-)

[afiorot@aiadm070 it-puppet-hostgroup-box]$ git push

Counting objects: 9, done.

Delta compression using up to 8 threads.

Compressing objects: 100% (5/5), done.

Writing objects: 100% (5/5), 428 bytes, done.

Total 5 (delta 3), reused 0 (delta 0)

To https://:@gitlab.cern.ch:8443/ai/it-puppet-hostgroup-box.git

02db183..6e80ed8 qa -> qa

[afiorot@aiadm070 it-puppet-hostgroup-box]$ git checkout master

Branch master set up to track remote branch master from origin by rebasing.

Switched to a new branch 'master'

[afiorot@aiadm070 it-puppet-hostgroup-box]$ git pull

Current branch master is up to date.

[afiorot@aiadm070 it-puppet-hostgroup-box]$ git cherry-pick 6e80ed8

Finished one cherry-pick.

[master fc67026] update eos fuse

1 files changed, 3 insertions(+), 3 deletions(-)

[afiorot@aiadm070 it-puppet-hostgroup-box]$ git push

Counting objects: 9, done.

Delta compression using up to 8 threads.

Compressing objects: 100% (5/5), done.

Writing objects: 100% (5/5), 430 bytes, done.

Total 5 (delta 3), reused 0 (delta 0)

To https://:@gitlab.cern.ch:8443/ai/it-puppet-hostgroup-box.git

76adbae..fc67026 master -> master

[afiorot@aiadm070 it-puppet-hostgroup-box]$ git branch

* master qa

#Run puppet on the node to update it

40) Multiple FUSE Mounts with a SAMBA Server:

In order to set multiple FUSE mount using a SAMBA machine we need to:

1) Create a EOS conf file for each mount: -rw-r--r-- 1 root root 8042 Aug 30 16:07 /etc/sysconfig/eos

-rw-r--r-- 1 root root 8048 Aug 30 16:12 /etc/sysconfig/eos.public

-rw-r--r-- 1 root root 8039 Aug 30 16:10 /etc/sysconfig/eos.user

In this case we have the general configuration (/etc/sysconfig/eos) and one for each mount:

/etc/sysconfig/eos.public

/etc/sysconfig/eos.user

2) In each mount config file we have to configure the following parameters: export EOS_FUSE_MGM_ALIAS=eosuser.cern.ch export EOS_FUSE_REMOTEDIR=/eos/user/ (remote dir) export EOS_FUSE_MOUNTDIR=/eos/user2/ (local dir)

3) Create the local folder to be mounted

4) Create a share in the /etc/samba/smb.conf for each mount:

[eos] comment = EOSUSER Samba path = /eos/ guest ok = no writable = yes

FUSE (Filesystem in Userspace) is a simple interface for userspace programs to export a virtual filesystem to the Linux kernel. FUSE also aims to provide a secure method for non privileged users to create and mount their own filesystem implementations.

IDMAPPING:

If samba uses a Domain Controller, then the samba users need to be resolvable by the system, either by using winbind, or a nsswitch with the ldap same back-end, or even have a synchronisation mechanism to always match local and samba users.

Using LDAP and Kerberos, a domain member running Winbind can enumerate users and groups in exactly the same way as a Windows 200x client would, and in so doing provide a much more efficient and effective Winbind implementation.

winbind is a component of the Samba suite of programs that solves the unified logon problem. Winbind uses a UNIX implementation of Microsoft RPC calls, Pluggable Authentication Modules (PAMs), and the name service switch (NSS) to allow Windows NT domain users to appear and operate as UNIX users on a UNIX machine. This chapter describes the Winbind system, the functionality it provides, how it is configured, and how it works internally.

Winbind is a daemon (service in Windows parlance) that runs on Samba clients and acts as a proxy for communication between PAM and NSS running on the Linux machine and Active Directory running on a DC. In particular, Winbind uses Kerberos to authenticate with Active Directory and LDAP to retrieve user and group information.

Winbind provides three separate functions:

 Authentication of user credentials (via PAM). This makes it possible to log onto a UNIX/Linux system using user and group accounts from a Windows NT4 (including a Samba domain) or an Active Directory domain.  Identity resolution (via NSS). This is the default when winbind is not used.  Winbind maintains a database called winbind_idmap.tdb in which it stores mappings between UNIX UIDs, GIDs, and NT SIDs. This mapping is used only for users and groups that do not have a local UID/GID. It stores the UID/GID allocated from the idmap uid/gid range that it has mapped to the NT SID. If idmap backend has been specified as ldap:ldap://hostname[:389], then instead of using a local mapping, Winbind will obtain this information from the LDAP database.

Note

If winbindd is not running, smbd (which calls winbindd) will fall back to using purely local information from /etc/passwd and /etc/group and no dynamic mapping will be used. On an operating system that has been enabled with the NSS, the resolution of user and group information will be accomplished via NSS. winbindd is a daemon that provides a number of services to the Name Service Switch capability found in most modern C libraries, to arbitrary applications via PAM and ntlm_auth and to Samba itself.

 Even if winbind is not used for nsswitch, it still provides a service to smbd, ntlm_auth and the pam_winbind.so PAM module, by managing connections to domain controllers. In this configuration the idmap config * : range parameter is not required. (This is known as `netlogon proxy only mode'.)  The Name Service Switch allows user and system information to be obtained from different databases services such as NIS or DNS. The exact behaviour can be configured through the /etc/nsswitch.conf file. Users and groups are allocated as they are resolved to a range of user and group ids specified by the administrator of the Samba system.  The service provided by winbindd is called `winbind' and can be used to resolve user and group information from a Windows NT server. The service can also provide authentication services via an associated PAM module.  The winbind solution is built on the winbind daemon (winbindd), a pluggable authentication module (PAM) called pam_winbind, a Name Service Switch (NSS) module called libnss_winbind, and a database file called winbind_idmap.tdb. http://windowsitpro.com/windows-server/q-what-samba-winbind-and-how-can-i-use-it-let-users-log-their-unix-or- linux-host-thei cerndc.cern.ch is a DNS round robin alias pointing to the Domain Controller servers (Active Directory). This DNS round robin is updated live according to the server availability to ensure a maximum uptime, targeting 100%.

It's a LDAP database and a Kerberos server. LDAP is accessible in authenticated mode only, from inside CERN only.

This LDAP service can be used by any service requiring account information, but also E-Group membership information to manage authorizations.

ACTIVE DIRECTORY, LDAP and KERBEROS:

Using LDAP Authentication: The easiest but least satisfactory way to use Active Directory for authentication is to configure PAM to use LDAP authentication, as shown in Figure 1. Although Active Directory is an LDAPv3 service, Windows clients use Kerberos (with fallback to NTLM), not LDAP, for authentication purposes.

LDAP authentication (called LDAP binding) passes the user name and password in clear text over the network. This is insecure and unacceptable for most purposes.

The only way to mitigate this risk of passing credentials in the clear is to encrypt the client-Active Directory communication channel using something such as SSL.

Using LDAP and Kerberos: Another strategy for leveraging Active Directory for Linux authentication is to configure PAM to use Kerberos authentication and NSS to use LDAP to look up user and group information, as shown in Figure 2. This scheme has the advantage of being relatively more secure,

Using Winbind: The third way to use Active Directory for Linux authentication is to configure PAM and NSS to make calls to the Winbind daemon. Winbind will translate the different PAM and NSS requests into the corresponding Active Directory calls, using either LDAP, Kerberos, or RPC, depending on which is most appropriate.

Check USERS info:

[root@p05614910f69219 ~]# wbinfo -i afiorot afiorot:*:1000:1004:Alessandro Fiorot:/home/CERN/afiorot:/bin/bash

Pag. 293 of Samba-HOWTO should be the correct configuration to use.

idmap config * : range = 1000-200000

LDAP SERVER: ldap://xldap.cern.ch:389

Before Samba can access the LDAP server, you need to store the LDAP admin password in the Samba-3 secrets.tdb database by: root# smbpasswd -w secret

Problems with :

- rebotting the machine then everything seems fine

- Samba starts without any problem: systemctl restart nmb.service; systemctl restart smb.service; systemctl restart winbind.service; systemctl status nmb.service; systemctl status smb.service; systemctl status winbind.service

- when restarting eosd, the system get freezed:

Mar 22 15:23:57 p05614910f69219.cern.ch systemd[1]: Starting SYSV: Starts eosd...

Mar 22 15:23:58 p05614910f69219.cern.ch kernel: fuse init (API version 7.22)

Mar 22 15:23:58 p05614910f69219.cern.ch systemd[1]: Mounting FUSE Control File System...

Mar 22 15:23:58 p05614910f69219.cern.ch systemd[1]: Mounted FUSE Control File System. Mar 22 15:23:58 p05614910f69219.cern.ch eosd[5448]: Starting eosd for instance: main[ OK ]

Mar 22 15:23:58 p05614910f69219.cern.ch eosd[5448]: EOS_FUSE_DEBUG : 0

Mar 22 15:23:58 p05614910f69219.cern.ch eosd[5448]: EOS_FUSE_LOWLEVEL_DEBUG : 0

Mar 22 15:23:58 p05614910f69219.cern.ch eosd[5448]: EOS_FUSE_NOACCESS : 1

Mar 22 15:23:58 p05614910f69219.cern.ch eosd[5448]: EOS_FUSE_KERNELCACHE : 1

Mar 22 15:23:58 p05614910f69219.cern.ch eosd[5448]: EOS_FUSE_DIRECTIO : 0

Mar 22 15:23:58 p05614910f69219.cern.ch eosd[5448]: EOS_FUSE_CACHE : 1

Mar 22 15:23:58 p05614910f69219.cern.ch eosd[5448]: EOS_FUSE_CACHE_SIZE : 300000000

Mar 22 15:23:58 p05614910f69219.cern.ch eosd[5448]: EOS_FUSE_BIGWRITES : 1

Mar 22 15:23:58 p05614910f69219.cern.ch eosd[5448]: EOS_FUSE_EXEC : 0

Mar 22 15:23:58 p05614910f69219.cern.ch systemd[1]: Started SYSV: Starts eosd.

Mar 22 15:23:58 p05614910f69219.cern.ch eosd[5448]: EOS_FUSE_NO_MT : 0

Mar 22 15:23:58 p05614910f69219.cern.ch eosd[5448]: EOS_FUSE_USER_KRB5CC : 0

Mar 22 15:23:58 p05614910f69219.cern.ch eosd[5448]: EOS_FUSE_USER_GSIPROXY : 0

Mar 22 15:23:58 p05614910f69219.cern.ch eosd[5448]: EOS_FUSE_USER_KRB5FIRST : 0

Mar 22 15:23:58 p05614910f69219.cern.ch eosd[5448]: EOS_FUSE_PIDMAP : 0

Mar 22 15:23:58 p05614910f69219.cern.ch eosd[5448]: EOS_FUSE_RMLVL_PROTECT : 1

Mar 22 15:23:58 p05614910f69219.cern.ch eosd[5448]: EOS_FUSE_RDAHEAD : 0

Mar 22 15:23:58 p05614910f69219.cern.ch eosd[5448]: EOS_FUSE_RDAHEAD_WINDOW : 131072

Mar 22 15:23:58 p05614910f69219.cern.ch eosd[5448]: EOS_FUSE_LAZYOPENRO : 0

Mar 22 15:23:58 p05614910f69219.cern.ch eosd[5448]: EOS_FUSE_LAZYOPENRW : 1

Mar 22 15:23:58 p05614910f69219.cern.ch eosd[5448]: EOS_FUSE_ATTR_CACHE_TIME : 10

Mar 22 15:23:58 p05614910f69219.cern.ch eosd[5448]: EOS_FUSE_ENTRY_CACHE_TIME : 10

Mar 22 15:23:58 p05614910f69219.cern.ch eosd[5448]: EOS_FUSE_NEG_ENTRY_CACHE_TIME : 30

Mar 22 15:23:58 p05614910f69219.cern.ch eosd[5448]: EOS_FUSE_CREATOR_CAP_LIFETIME : 30

Mar 22 15:23:58 p05614910f69219.cern.ch eosd[5448]: EOS_FUSE_FILE_WB_CACHE_SIZE : 67108864

Mar 22 15:23:58 p05614910f69219.cern.ch eosd[5448]: EOS_FUSE_LOG_PREFIX : main

Mar 22 15:23:58 p05614910f69219.cern.ch polkitd[2147]: Unregistered Authentication Agent for unix-process:5443:134241 (system bus name :1.23, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_US.UTF-8) (disconnected from bus) Broadcast message from [email protected] (Tue 2016-03-22 15:23:58 CET): systemd[1]: Caught , dumped core as pid 5482.

Broadcast message from [email protected] (Tue 2016-03-22 15:23:58 CET): systemd[1]: Freezing execution.

Mar 22 15:23:58 p05614910f69219.cern.ch systemd[1]: Caught , dumped core as pid 5482.

Message from syslogd@p05614910f69219 at Mar 22 15:23:58 ... systemd:Caught , dumped core as pid 5482.

Message from syslogd@p05614910f69219 at Mar 22 15:23:58 ... systemd:Caught , dumped core as pid 5482.

Mar 22 15:23:58 p05614910f69219.cern.ch systemd[1]: Freezing execution.

Message from syslogd@p05614910f69219 at Mar 22 15:23:58 ... systemd:Freezing execution.

Message from syslogd@p05614910f69219 at Mar 22 15:23:58 ... systemd:Freezing execution.

......

Troubleshoooting:

After joining the AD domain, so the server became an AD Domain Member, the root user seems unable to connect to the machine, in fact it asks for the password.

A solution to resolve this is to execute the following command: cern-get-keytab --keytab /etc/krb5.keytab --force --verbose

Now it will be possible to login again, but another problem will come: I guess that doing that the computer will "disjoin" from the AD Domain. The symptoms can be that the computer can’t login when connected to the network, message that the computer account has expired, the domain certificate is invalid, etc.

These all stem from the same problem and that is that the secure channel between the computer and domain is hosed.

The classic way to fix this problem is to unjoin and rejoin the domain. Doing so is kind of a pain because it requires a couple of reboots and the user profile isn’t always reconnected. Ewe. Further if you had that computer in any groups or assigned specific permissions to it those are gone because now your computer has a new SID, so the AD doesn’t see it as the same machine anymore.

When you register a system with domain controller (net ads join), this will create a valid host principle for the system in /etc/krb5.keytab.

This will create a computer object on AD. This object tracks the principle on the AD side what the data is stored in /etc/krb5.keytab client side.

Terminology:

Domain : The name used to group and accounts.

SID : Each computer that joins the domain as a member must have a unique SID or System Identifier.

SMB : Server Message Block.

NETBIOS: Network naming protocol used as an alternative to DNS. Mostly legacy, but still used in Windows Networking.

WINS: Windows Information Naming Service. Used for resolving Netbios names to windows hosts.

Winbind: Protocol for windows authentication.

Keytab: file that contains pairs of Kerberos principles and encrypted keys, which are derived from the kerberos apssword.

CERN-get-keytab: is a CERN utility which stores in local keytab file host/services identities acquired from cern active directory KDC.

.k5login: allows kerberised root access for the LanDB responsible

DEFINITIONS:

As defined by Microsoft, in Active Directory server roles, computers that function as servers within a domain can have one of two roles: member server or domain controller. Abbreviated as DC, domain controller is a server on a Microsoft Windows or Windows NT network that is responsible for allowing host access to Windows domain resources. The domain controllers in your network are the centerpiece of your Active Directory . It stores user account information, authenticates users and enforces security policy for a Windows domain.

Member servers typically function as the following types of servers: file servers, application servers, database servers, Web servers, certificate servers, firewalls and remote-access servers.

4.5) INFO net ads info net ads lookup net ads status journalctl -f

To check how many users are connected to the samba server --> smbstatus

LIST OF SERVICES:

[root@centos-samba ~]# ls -l /usr/lib/systemd/system

Check which version of SAMBA --> smbd -V

DAEMONS:

[root@samba-test-server samba-4.3.4]# ls /sbin/

LOGS:

[root@samba-test ~]# tail -F /var/log/samba/log.nmbd --> check these errors

[root@samba-test ~]# tail -F /var/log/samba/log.smbd --> " " "

ERROR SOLVING:

[root@centos-samba2 ~]# net ads -d 5 testjoin

......

PERMISSIONS: Permission precedence

Samba comes with different types of permissions for share. Try to remember few things about UNIX and Samba permissions.

(a) Linux system permissions take precedence over Samba permissions. For example if a directory does not have Linux write permission, setting samba writeable = Yes (see below) will not allow to write to shared directory / share.

(b) The filesystem permission cannot be take priority over Samba permission. For example if filesystem mounted as readonly setting writeable = Yes will not allow to write to any shared directory or share via samba server.

In short:

Limits set by kernel-level access control such as file permissions, file system mount options, ACLs, and SELinux policies cannot be overridden by Samba. Both the kernel and Samba must permit the user to perform an action on a file before that action can occur.

Connecting to Samba with smbclient smbclient is an ftp-like commandline tool you can use to connect to a samba server and upload or download files. smbclient supports MS-DFS (Microsoft's Distributed File System). To access the two types of Samba shares described in the previous section of the document use one of the following commands with $netid replaced by your Network ID: smbclient //samba.lafayette.edu/home -U $netid smbclient //samba.lafayette.edu/shared -U $netid

You will then see a prompt: Enter your password: Type your password, then press enter. If the logon was successful, you should see something like: Domain=[DOMAIN] OS=[Unix] Server=[Samba x.y.z] And a shell prompt: smb: \> Now that you are connected, there are many commands at your disposal, including:

 ls : lists all files/subdirectories of the present directory  cd [] : moves to the directory specified by [dirname] on the remote server  lcd [dirname : moves to the directory specified by [dirname] on the local machine  mkdir [dirname] : makes a directory with the specified name in the present directory  [dirname] : deletes the specified directory  rm [filename] : deletes the specified file  get [filename] [localfilename] : downloads the specified file to the local machine. If [localfilename] is specified, renames the file to that on the local machine.  put [local file name] [remote file name] : copies the file with name [local file name] on the local machine to the server, with name [remote file name], if specified  : terminates the connection  help : displays a list of commands. 'help command' will give you information about 'command'

Note that accessing a different Samba share will require you to log out of smbclient and connect again to the new share. For instance, if you have access to two shares, 'home' and 'shared', if you access the 'home' share and want to transfer a file from there to the 'shared' share, you will have to download the file from 'home' (via the get command), log out, log on to the 'shared' share, and upload the file.