<<

A TUTORIAL FOR COLUMBUS USAGE OF SYMMETRY AND PARALLEL CALCULATIONS

Felix Plasser

Institute for – University of Vienna

Vienna, 2011

Table of contents Table of contents ...... 2 1. Before starting ...... 3 1.1 Introduction ...... 3 1.2 Notation ...... 3 2. MCSCF single point calculation with symmetry ...... 4 2.1 Orbital occupation and DRT tables ...... 4 2.2 Geometry file ...... 5 2.3 COLINP integral input ...... 5 2.4 COLINP step: SCF input ...... 6 2.5 COLINP step: MCSCF input ...... 7 2.6 Running a MCSCF single point job ...... 8 2.7 Checking the results ...... 9 3. Parallel MR-CISD single point calculation ...... 10 3.1 Basic input ...... 10 3.2 COLINP step: MRCI input ...... 10 3.3 Running a parallel MRCI single point job ...... 13 3.4 Checking the results ...... 14 3.5 Performance fine tuning ...... 14 4. Contact ...... 15 Appendix: Orbital occupation and DRT tables ...... 16

COLUMBUS tutorial 2 1. Before starting

1.1 Introduction This tutorial will give an introduction into high performance calculations with COLUMBUS . It will be shown how explicit symmetry can be used and how parallel MR-CISD calculations can be set up.

The use of symmetry does not only have an advantage in relations to performance, but it also gives more flexibility in the wave function definitions and more immediately meaningful output. The disadvantage is that the input is somewhat more complex and that there are more possibilities for errors.

The parallel MR-CISD program pciudg.x is intended for massively parallel calculations up to hundreds of processors. But an efficient load balancing is quite challenging considering the complex nature of MR-CI wave functions. In this tutorial several tools for setting up such a calculation will be presented.

1.2 Notation The same notation as in the standard tutorial will be used.

This kind of font indicates what is seen on the screen and the command lines that you should write ! Comments come here

Important information related to Columbus but not necessarily connected to the current job comes in boxes like this.

COLUMBUS tutorial 3 2. MCSCF single point calculation with symmetry In this section we will prepare a complete input for a single point calculation at MCSCF level with explicit usage of symmetry. The system is the p-dimethylaminobenzonitrile (DMABN), which will be calculated using a complete active space composed by 10 electrons in 9 orbitals [CASSCF(10/9)]. Three states will be included in the state averaging procedure (SA-3) and

the cc-pVDZ basis set will be used. The calculation will be performed in the C2v point group).

2.1 Orbital occupation and DRT tables With explicit usage of symmetry, the occupation tables are somewhat more complex. Occupations in each irrep have to obtained from qualitative reasoning, exploratory calculations or by trial and error. 1. Fill out the occupation table.

System: DMABN Point Group: C2v N. Electrons: 78 Level: MRCI(6,5)/SA-3-CASSCF(10,9) IRREP a1 b1 b2 a2 SCF DOCC 20 5 12 2 OPSH MCSCF DOCC 20 1 12 1 RAS CAS 0 7 0 2 AUX MRCI FC 8 0 3 0 FV DOCC 12 3 9 1 ACT 0 3 0 2 AUX INT 12 6 9 3

2. Fill out the DRT table.

State Multiplicity N. electrons Symmetry 1 1 78 A1 2 1 78 B2 3 1 78 A1

Number of distinct row tables (DRTs): 2

In this case two DRTs are needed, considering that state averaging is performed over two distinct symmetries.

COLUMBUS tutorial 4 2.2 Geometry file 3. In the TUTORIAL directory create a subdirectory called DMABN_CAS_SP

4. Move to this directory and create a file called geom.uniq in Columbus format:

C 6.0 0.00000000 0.00000000 4.94532492 12.00000000 C 6.0 0.00000000 2.28388802 3.59270636 12.00000000 C 6.0 0.00000000 2.28719145 0.96449136 12.00000000 C 6.0 0.00000000 0.00000000 -0.42775069 12.00000000 C 6.0 0.00000000 0.00000000 7.65220512 12.00000000 C 6.0 0.00000000 2.36428675 -4.37783677 12.00000000 N 7.0 0.00000000 0.00000000 9.88116650 14.00307401 N 7.0 0.00000000 0.00000000 -3.02080924 14.00307401 H 1.0 0.00000000 4.07497903 4.61485947 1.00782504 H 1.0 0.00000000 4.09874337 -0.00853452 1.00782504 H 1.0 0.00000000 1.97923145 -6.40956646 1.00782504 H 1.0 1.68282187 3.51210363 -3.94559387 1.00782504

Only symmetry unique atoms are given here.

2.3 COLINP integral input 5. Run:

> $COLUMBUS/colinp

-> 1) Integral program input (for argos//turbocol/) 2) SCF input 3) MCSCF input 4) CI input 5) Set up job control 6) Utilities 7) Exit the input facility

6. Use the prepinp utility

Run the preparation program (prepinp)? (y|n) y ! Press after the input 7. Enter information about program, symmetry and geometry file.

Input for DALTON (1) or MOLCAS (2): 1

Enter the point group symmetry: c2v ! Only Abelian groups

Name of the file containing the cartesian coordinates of the unique atoms (COLUMBUS format): geom.uniq

Number of atoms = 12 ! verify that the file was read in correctly Sum formula: H4 C6 N2

8. Enter information about basis sets.

Show only basis sets containing the following string: (e.g. 6-31g, cc-pv - leave empty to show all basis sets) cc-pvd

COLUMBUS tutorial 5 -- Set basis set -- 1: cc-pvdz (4s,1p)->[2s,1p] 3: aug-cc-pvdz 4: aug'-cc-pvdz (without p function in aug-set) 5: d'-aug-cc-pvdz (without p function in d-set) 6: d-aug-cc-pvdz 15: diffuse functions for cc-pvdz(1s,1p) 0: Other library

Select the basis set for atom H: 1 ! cc-pvdz was selected ... Select the basis set for atom N: 1 ... Select the basis set for atom C: 1 ...

Until now you've set the following basis sets: H :: cc-pvdz (4s,1p)->[2s,1p] C :: cc-pvdz(9s,4p,1d)->[3s,2p,1d] N :: cc-pvdz(9s,4p,1d)->[3s,2p,1d]

Reorder geom file for geometry optimization and orbital print out? (y) y ! per default the geometry should be reordered to put the hydrogens at the back of the file

Normal termination of prepinp. See result in inpcol.

9. Perform an automatic input of iargos.x .

... Would you like to do an interactive input? n ! Select “no” .

If you select “yes”, COLINP will ask to enter all information about geometry and basis sets again, atom per atom. In the case of ANO basis sets it is currently necessary to perform the whole interactive iargos.x input. But you may use the file inpcol (as created in the previous step) as a template.

2.4 COLINP step: SCF input 10. Select option 2) SCF input.

1) Integral program input (for argos/dalton/turbocol) -> 2) SCF input 3) MCSCF input 4) CI input 5) Set up job control 6) Exit the input facility ... Do you want a closed shell calculation ? y

Input the no. of doubly occupied orbitals for each irrep, DOCC: 20 5 12 2 ! Insert this according to the orbital occupation table above

COLUMBUS tutorial 6 The orbital occupation is:

a1 b1 b2 a2 DOCC 20 5 12 2 OPSH 0 0 0 0

Is this correct?

Would you like to change the default program parameters? Input a title: -->

2.5 COLINP step: MCSCF input 11. Select option 3) MCSCF input

1) Integral program input (for argos/dalton/turbocol/molcas) 2) SCF input -> 3) MCSCF input 4) CI input 5) Set up job control 6) Utilities 7) Exit the input facility

MCSCF WAVE FUNCTION DEFINITION ======

(for an explanation see the COLUMBUS documentation and tutorial)

prepare input for no(0),CI(1), MCSCF(2), SA-MCSCF(3) analytical gradient 0

Enter number of DRTS [1-8] 2 ! This information is in the DRT Table number of electrons for DRT #1 (nucl. charge: 78) 78 multiplicity for DRT #1 1 spatial symmetry for DRT #1 1 ! Representation A1 (see Orbital occupation table) excitation level (cas,ras)->aux 0 excitation level ras->(cas,aux) 0 number of electrons for DRT #2 (nucl. charge: 78) 78 multiplicity for DRT #1 1 spatial symmetry for DRT #1 3 ! Representation B2 excitation level (cas,ras)->aux 0 excitation level ras->(cas,aux) 0

number of doubly occupied orbitals per irrep 20 1 12 1 number of CAS orbitals per irrep 0 7 0 2

Apply add. group restrictions for DRT 1 [y|n] n

COLUMBUS tutorial 7

Convergence 1. Iterations #iter [100] #miter [50 ] #ciiter [50 ] 2. Thresholds knorm [1.e-4 ] wnorm [1.e-4 ] DE [1.e-8 ]

3. HMC-matrix build explicitly [n] diagonalize iteratively [y] 4. Miscallenous quadratic cnvg. [y] from #iter [5 ] ... only with wnorm < [1.e-3 ]

Resolution (NO) RAS [NO ] CAS [NO ] AUX [NO ] State-averaging DRT 1: #states [ 2 ] weights[ 1,1 ] DRT 2: #states [1 ] weights[1 transition moments / non-adiabatic couplings [n]

FINISHED [ y]

In this last panel, the only option that you should worry about is “#states” (number of states in the state average procedure). In this example we want to perform state averaging over two A1 states (DRT 1) and one B 2 state (DRT 2).

2.6 Running a MCSCF single point job 12. Select option 5) Set up job control

1) Integral program input (for argos/dalton/turbocol/molcas) 2) SCF input 3) MCSCF input 4) CI input -> 5) Set up job control 6) Utilities 7) Exit the input facility

13. Select option 1) Job control for single point or gradient calculation

-> 1) Job control for single point or gradient calculation 2) Potential energy curve for one int. coordinate 3) Vibrational frequencies and force constants 4) Exit

14. Select option 1) Single point calculation

-> 1) single point calculation 2) geometry optimization with GDIIS 3) geometry optimization with SLAPAF 4) saddle point calculation (local search - GDIIS) 5) stationary point calculation (global search - RGF) 6) optimization on the crossing seam (GDIIS) 7) optimization on the crossing seam (POLYHES) 8) Exit

COLUMBUS tutorial 8 15. Select the following options marked below and finish with “Done with selections”.

-> 1) (Done with selections) 2) [ X] SCF 3) [ ] MOLCAS DFT 4) [ X] MCSCF 5) [ ] standard MR-CISD with one DRT (ciudg) 6) [ ] standard MR-CISD with several DRTs (ciudgav) 7) [ ] parallel MR-CISD (pciudg) 8) [ ] one-electron properties for all methods 9) [ ] transition moments for MCSCF 10) [ ] transition moments for MR-CISD 11) [ ] single point gradient 12) [ ] nonadiabatic couplings (and/or gradients) 13) [ ] value calculation for MR-CISD 14) [ X] convert MOs into format 15) [ ] get starting MOs from a higher symmetry 16) [ ] finite field calculation for all methods 17) [ ] include point charges 18) [ ] MOLCAS RASSCF/CASSCF 19) [ ] MOLCAS CASPT2 20) [ ] MOLCAS CCSD 21) [ ] MOLCAS CCSD(T)

16. Run Columbus.

> $COLUMBUS/runc -m 1700 > runls

“-m” command defines the memory allocated to COLUMBUS , in this case 1700 MB.

2.7 Checking the results 17. Check the convergence of the MCSCF calculation in the LISTINGS/mcscfsm.sp file.

... final mcscf convergence values: 15 -455.5102592748 2.506E-09 1.188E-06 7.493E-07 1.798E-13 T *converged*

18. The MCSCF energies are written in the same file LISTINGS/mcscfsm.sp file.

------Individual total energies for all states:------DRT #1 state # 1 wt 0.333 total energy= -455.640919821, rel. (eV)= 0.000000 DRT #1 state # 2 wt 0.333 total energy= -455.420326215, rel. (eV)= 6.002660 DRT #2 state # 1 wt 0.333 total energy= -455.469531788, rel. (eV)= 4.663708 ------

The states are ordered according to DRT number rather than state energies.

COLUMBUS tutorial 9 3. Parallel MR-CISD single point calculation In this section we will perform an MR-CISD calculation using the parallel program. We will explore different load balancing schemes that allow for the possibility of performing

massively parallel calculation. The calculations will be performed for the lowest state of B2 symmetry. This example, containing about 60 million configurations, can be easily computed on a standard node with 8 CPUs. Note that larger expansions may need a significantly larger amount of memory and the calculation would have to be distributed over more nodes.

3.1 Basic input 1. In the TUTORIAL directory create a subdirectory called DMABN_MRCISD:

2. Copy the input files from the MCSCF directory:

> cp ../DMABN_CAS_SP/* .

3. Copy the molecular orbital file generated in the MCSCF calculation.

> cp ../DMABN_CAS_SP/MOCOEFS/mocoef_mc.sp mocoef

3.2 COLINP step: MRCI input

> $COLUMBUS/colinp

1) Integral program input (for argos/dalton/turbocol) 2) SCF input 3) MCSCF input -> 4) CI input 5) Set up job control 6) Utilities 7) Exit the input facility

CI WAVE FUNCTION DEFINITION ... press return to continue

-> 1) Def. of CI wave function - one-DRT case 2) Def. of CI wave function - multiple-DRT case 3) Skip DRT input (old input files in the current directory)

Do you want to compute gradients or non-adiabatic couplings? [y|n] n

4. The orbital occupation and DRT information is defined in the following options:

Multiplicity: #electrons: Molec. symm.

count order (bottom to top): fc-docc-active-aux-extern-fv irreps a1 b1 b2 a2 # basis fcts 82 35 62 25

Enter the multiplicity 1 ! we want to calculate a singlet Enter the number of electrons 78 Enter the molec. spatial symmetry 3 ! B2 symmetry number of frozen core orbitals per irrep 8 0 3 0 number of frozen virt. orbitals per irrep 0 0 0 0

COLUMBUS tutorial 10 number of internal(=docc+active+aux) orbitals per irrep 12 6 9 3 ref doubly occ orbitals per irrep 12 3 9 1 auxiliary internal orbitals per irrep 0 0 0 0 Enter the excitation level (0,1,2) 2 Generalized interacting space restrictions [y|n] y ! faster Enter the allowed reference symmetries 3

5. The job summary should look like this now:

Multiplicity:1 #electrons:78 Molec. symm.:b2 count order (bottom to top): fc-docc-active-aux-extern-fv irreps a1 b1 b2 a2 # basis fcts 82 35 62 25 frozen core 8 0 3 0 frozen virt 0 0 0 0 internal 12 6 9 3 ref. docc. 12 3 9 1 ci active 0 3 0 2 ci auxiliary 0 0 0 0 external 62 29 50 22 exc.level:2 gen.space:y allowed ref. syms:3

6. Proceed with the input:

Apply additional group restrictions for DRT [y|n] n

Choose CI program: sequential ciudg [1]; parallel ciudg[2] 2

7. The last panel will appear. Change the settings for the parallel calculation.

Type of calculation: CI [Y] AQCC [N] AQCC-LRT [N] LRT shift: LRTSHIFT [0 ] State(s) to be optimized NROOT [ 1 ] ROOT TO FOLLOW [0] Reference space diagonalization INCORE[Y] NITER [ ] Reference space diagonalization INCORE[Y] NITER [ ] RTOL [1e-3,1e-3, ] Bk-procedure: NITER [1 ] MINSUB [1 ] MAXSUB [6 ] RTOL [1e-3,1e-3, ] CI/AQCC procedure: NITER [20 ] MINSUB [3 ] MAXSUB [6 ] RTOL [1e-3,1e-3, ]

parallel options NCPU [ 8 ] MAX. MEMORY PER PROCESS (MB) [ 1700 ] EFF. AVERAGE BANDWIDTH (MB/(s*ncpu)) [ 50 ] INTEGRAL FILES ON DISK: [N] NUMBER OF PROCESSORS PER NODE [4 ]

FINISHED [ ]

MAXSUB – this determines the maximum number of CI-vectors stored in memory and therefore the amount of memory needed NCPU – number of processors MEMORY PER PROCESS – this should be somewhat lower than the physical memory available per CPU

COLUMBUS tutorial 11 EFF. AVERAGE BANDWIDTH – this will have some influence on the final segmentation scheme INTEGRAL FILES ON DISK – it is recommended to choose “N” here, in order to make sure that the calculation time will not be determined by disc I/O NUMBER OF PROCESSORS PER NODE – this is only of interest if the files are on disc

8. Do not select any transition moments

9. Pre-segmentation

Parallel CI input parameter generation

Do you want to: [0] Keep default ciudgin file [1] Perform customization using cidrt.x, makpciudg.x, cimkseg.x [2] Run programs in batchmode (and return to [3] afterwards) [3] Call only pscript (use this if [1] or [2] was already called) Choice: [ 1]

Here some heuristics are called that allow for an optimal segmentation of the CI-vector. [0] is only recommend for small calculations. [1] directly calls the preparation programs. For larger calculations, [1] may take too much time to be running inside of colinp and [2] allows to run the programs in batchmode outside.

CIDRT calculation performed successfully total number of configurations: 60834170 makpciudg.x: DRT Integral filesizes in bytes diagint 393120 ofdg 43734600 fil3x 131854408 fil3w 133689360 fil4w 182446656 fil4x 177466072 Executing cimkseg.x, this may take several minutes, please wait... progress can be monitored in file cimkseg.info ... setting up prelimniary segmentation ....

ncpu task coremem totvol tcomm bandwidth 8 38 1249 13.54 GB 277.24 s 50.0 MB/s

pciudg (runc) must be called with at least -m 1249

An important functionality of this script is determining the amount of memory needed. If you get an error message “not enough memory”, rerun it specifying a larger NCPU or more MEMORY PER PROCESS.

You can check the segmentation information in the ciudgin file

nseg4x=1,1,2,2 nnseg4= 1,2956,30406,34805,38851,43550 nseg3x=1,1,2,3 nnseg3= 1,2956,30406,34205,38851,42250,44950 nseg2x=1,1,1,2 nnseg2= 1,2956,30406,38851,46350 nsegwx=1,1,2,2 nnsegwx= 1,2956,30406,35605,38851,43050 nseg1x=1,1,3,3

COLUMBUS tutorial 12 nnseg1= 1,2956,30406,32505,35705,38851,42350,45950 nseg0x=2,1,2,2 nnseg0= 1,2700,2956,30406,37705,38851,46650 maxseg=8

3.3 Running a parallel MRCI single point job 10. Enter the job control menu

1) Integral program input (for argos/dalton/turbocol/molcas) 2) SCF input 3) MCSCF input 4) CI input -> 5 ) Set up job control 6) Utilities 7) Exit the input facility

11. Select a single point job:

-> 1 ) Job control for single point or gradient calculation 2) Potential energy curve for one int. coordinate 3) Vibrational frequencies and force constants 4) Exit

12. Select single point calculation

-> 1 ) single point calculation 2) geometry optimization 3) saddle point calculation (local search - GDIIS) 4) stationary point calculation (global search - RGF) 5) nonadiabatic coupling (single point) 6) optimization on the crossing seam (GDIIS) 7) optimization on the crossing seam (POLYHES) 8) Exit

13. Choose the following options:

-> 1) (Done with selections) 2) [ ] SCF 3) [ ] MOLCAS DFT 4) [ ] MCSCF 5) [ ] standard MR-CISD with one DRT (ciudg) 6) [ ] standard MR-CISD with several DRTs (ciudgav) 7) [ X] parallel MR-CISD (pciudg) 8) [ ] one-electron properties for all methods 9) [ ] transition moments for MCSCF 10) [ ] transition moments for MR-CISD 11) [ ] single point gradient 12) [ ] nonadiabatic couplings (and/or gradients) 13) [ ] value calculation for MR-CISD 14) [ ] convert MOs into molden format 15) [ ] get starting MOs from a higher symmetry 16) [ ] finite field calculation for all methods 17) [ ] include point charges 18) [ ] MOLCAS RASSCF/CASSCF 19) [ ] MOLCAS CASPT2 20) [ ] MOLCAS CCSD 21) [ ] MOLCAS CCSD(T)

COLUMBUS tutorial 13 14. Run Columbus.

Use an appropriate command to run parallel Columbus. For more information check

http://www.univie.ac.at/columbus/docs_COL70/parallel/parallel_input_exec.html

> $COLUMBUS/runc -m 1700 –msmp 15000 -machinefile $TMPDIR/machines –nproc 8 > runls

Running this should take about 20 minutes.

3.4 Checking the results 15. Check the convergence of the MR-CI calculation in the LISTINGS/ciudgsm.sp file.

... mr-sdci convergence criteria satisfied after 14 iterations. ...

16. The MRCI energies are written in the same file LISTINGS/ciudgsm.sp file. final mr-sdci convergence information: mr-sdci # 14 1 -456.5072587416 6.1801E-07 2.6376E-07 9.1062E-04 1.0000E-03

17. Performance information in WORK/ciudg.perf .

------wall clock times for the individual iterations ------iter mult multnx loop sync total pdegree [s] [s] [s] [s] [s] ... 11 61.8247 61.3666 3.6719 0.6392 65.6659 0.9898 12 62.8940 62.4382 3.6703 0.4128 65.3042 0.9935 13 61.8384 61.4039 3.6460 0.2113 64.2448 0.9966 14 61.9524 61.5124 3.6642 0.5044 65.6603 0.9919

The presegmentation was good enough to provide a parallelization degree above 99%. The average total time per iteration is about 65s.

3.5 Performance fine tuning In this section the tools for performance fine tuning will be introduced. We could typically run this script after an MR-CI calculation with only 3-4 iterations. Or we could use the results for different geometries of the same molecule. Performance fine tuning should give a calculation with high parallelization degree, possibly also on a larger number of nodes.

18. Optionally backup the LISTINGS directory of the first run.

19. $COLUMBUS/colinp - Utilities

1) Integral program input (for argos/dalton/turbocol/molcas) 2) SCF input 3) MCSCF input 4) CI input 5) Set up job control -> 6) Utilities

COLUMBUS tutorial 14 7) Exit the input facility

20. Parallel CI

submenu 1: Utilities

1) Basic information -> 2) Parallel CI: load balancing 3) MOLCAS support info 4) Reorder MOs 5) Change the basis set 6) Convert MOs to molden format 7) CHELPG calculation 8) Linear interpolation of (internal) coordinates (LIC) 9) Exit

21. Enter the information (you can enter several NCPU if you want to)

PARTITIONING OF THE TOTAL LOAD

Partitioning based upon the timing info (ciudg.perf) and configuration space structure (SEGANALYSE); computed for the colon separated list of processor counts.

NCPU (list (2,4 ..) or range ([from,to ,inc]) [8,24,64 ] maximum subspace dimension [6 ] maximum memory per node in MB [1700 ] maximum eff. bandwidth per node in MB/s [50 ] integrals on disk [n]

------Partitioning of the total load ------

p# ncpu #task memory volume tcomm bandwidth wall clock maxseg maxtask 1 8 28 1187.29 MB 8.14 GB 20.85 s 50.0 MB/s 83.2 s 5 72.8 s 2 24 74 598.29 MB 11.81 GB 10.08 s 50.0 MB/s 30.4 s 9 24.4 s 3 72 244 234.43 MB 20.85 GB 5.93 s 50.0 MB/s 12.4 s 21 8.6 s

Select partitioning (enter partition number p#) 2

The two files created by this are ciudgin and TASKLIST . You may now rerun the job using this setup, perform the calculation for a different geometry, etc.

It is advisable to rerun this calculation on at least two nodes to check that internode communication is working properly. This CI expansion is probably too small to get 100% parallelization degree on a larger number of CPUs, but you should observe that the calculation speeds up significantly. With these settings for 24 CPUs we obtained CI-iteration times of 25s and a parallelization degree of about 90%.

4. Contact If you have any questions or comments about this tutorial, feel free to write me an email:

[email protected]

COLUMBUS tutorial 15 Appendix: Orbital occupation and DRT tables

System: Point Group: N. Electrons: Level: IRREP

SCF DOCC OPSH MCSCF DOCC RAS CAS AUX MRCI FC FV DOCC ACT AUX INT

State Multiplicity N. electrons Symmetry

Number of distinct row tables (DRTs):

1 2 3 4 5 6 7 8 D2h ag b3u b2u b1g b1u b2g b3g au D2 a b2 b1 b3 C2h ag bu au bg C2v a1 b1 b2 a2 Ci ag au Cs a´ a´´ C2 a b C1 a

COLUMBUS tutorial 16