University of Groningen

Novel halohydrin dehalogenases by protein engineering and database mining Schallmey, Marcus

IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below. Document Version Publisher's PDF, also known as Version of record

Publication date: 2015

Link to publication in University of Groningen/UMCG research database

Citation for published version (APA): Schallmey, M. (2015). Novel halohydrin dehalogenases by protein engineering and database mining. University of Groningen.

Copyright Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).

Take-down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum.

Download date: 27-09-2021 

                   

   !

" #   !

$ !

% $   &

'   !

1 Junior Professorship for Biocatalysis, RWTH Aachen University, Worringerweg 3, 52074 Aachen, Germany. 2 Enzymicals AG, Walther-Rathenau-Straße 49a, 17489 Greifswald, Germany.

   Schallmey M, Koopmeiners J, Wells E, Wardenga R, Schallmey A: Expanding the halohydrin dehalogenase family: Identification of novel by database mining. Applied and Environmental Microbiology 2014, 80 :7303–7315. © 2014 American Society for Microbiology

 (  

' Halohydrin dehalogenases are very rare enzymes which are naturally involved in the mineralization of halogenated xenobiotics. Due to their catalytic potential and promiscuity, many biocatalytic reactions have been described which have led to several interesting and also industrially important applications. Nevertheless, only a handful of these enzymes have been made available through recombinant techniques and hence it is of general interest to expand the repertoire of these enzymes to enable novel biocatalytic applications. After identification of specific sequence motifs, 37 novel enzyme sequences were readily identified in public sequence databases. All enzymes, which could be heterologously expressed, also catalyzed typical halohydrin dehalogenase reactions. Phylogenetic inference for enzymes of the halohydrin dehalogenase enzyme family confirmed that all enzymes form a distinct monophyletic clade within the short chain dehydrogenase/reductase superfamily. In addition, the majority of novel enzymes are substantially different to previously known phylogenetic subtypes. Consequently, four additional phylogenetic subtypes were defined which largely expand the halohydrin dehalogenase enzyme family. We show that the enormous wealth of environmental and genome sequences present in public sequence databases can be tapped for the in silico identification of very rare but nonetheless biotechnologically important biocatalysts. Our findings help to readily identify halohydrin dehalogenases in ever growing sequence databases and, in consequence, make even more members of this interesting enzyme family available to the scientific and industrial community.

'     MS and AS designed the experiments. MS performed the homology searches and extracted the sequence motifs. MS designed the synthetic genes which were then subcloned, expressed and the enzymes were characterized together with JK and EW. MS, JK, EW, RW, and AS wrote the manuscript.

) *    

    Halohydrin dehalogenases (also called haloalcohol dehalogenases, haloalcohol/ halohydrin epoxidases, or hydrogen-halide ; EC 4.5.1.-) (HHDHs) are biotech- nologically relevant enzymes that catalyze the reversible dehalogenation of β-haloalcohols under epoxide formation [1, 2, 188]. Besides being useful for the production of enantiopure haloalcohols [24, 29, 63] and epoxides [29, 107, 174, 175], these enzymes can also be applied in the formation of novel carbon-carbon, carbon-nitrogen, or carbon-oxygen bonds. Due to their promiscuous epoxide ring-opening activity, cyanide, azide or nitrite are for example accepted as nucleophiles in the ring-opening reaction, thus leading to a diverse range of products [45]. Examples of relevant HHDH applications are the production of optically pure C3 or C4 fine chemical precursors [29, 31, 76, 174], including the multi-ton scale production of enantiopure ( R)-4-cyano-3-hydroxybutyrate esters for statin drugs [47], or the production of chiral tertiary alcohols [79, 80, 100, 101] for which conventional organic synthesis is rather challenging ( Figure 1 ) [104, 106].

               "          &     "        

    %        !" )!  (                !"!    !"!       %   !  ' "   (  ! "!                          !"!              %  # $  %     Figure 1. Examples of HHDH-catalyzed reactions include A) the preparation of (optically pure) haloalcohols and epoxides [29] as well as B) synthetic routes towards statin side chain precursors [47] and C) tertiary alcohols [80, 100]. Despite their designated potential as biocatalysts, few HHDHs have thus far been made available to the scientific and industrial community since the initial discovery of bacterial enzymes with HHDH activities more than 45 years ago. Since then, a couple of bacterial species have been reported to possess HHDH activity but only very few HHDH enzymes have been purified and characterized biochemically, which has been recently reviewed elsewhere in detail [2, 188]. Of these, only six HHDH genes have been cloned and expressed recombinantly, namely hheA from Corynebacterium sp. strain N-1074 [32], hheA2 from

+ (  

Arthrobacter sp. strain AD2 [9], hheB from Corynebacterium sp. strain N-1074 [32], hheB2 from Mycobacterium sp. strain GP1 [9] and two identical hheC sequences from Agrobacterium radiobacter AD1 [9] and Rhizobium sp. strain NHG3 [33]. All of the cloned HHDHs belong to the short-chain dehydrogenase/reductase (SDR) superfamily and exhibit several major features of this diverse enzyme class [9, 56, 58]. For example, all known HHDHs make use of a , share the commonly found homomultimeric quarternary assembly, and both crystallized HHDHs, namely HheA2 [39] and HheC [38], possess a tertiary structure similar to other Rossmann-fold proteins. Besides these overall similarities to SDR enzymes, HHDHs can be distinguished from SDR enzymes by a combination of mechanistic and sequence/structure characteristics. The concerted activity of Ser-Tyr-Lys in classical SDR enzymes abstracts a proton from the substrate’s hydroxyl group and an enzyme-bound NAD(P) + is responsible for hydride abstraction. In contrast, HHDHs possess a catalytic triad that is composed of Ser-Tyr-Arg and, instead of a cofactor-, a spacious anion-binding pocket is present in the structures of HheA2 and HheC [38, 39]. In consequence, all known HHDH sequences form only a minute but well- defined fraction within the SDR superfamily with more than 163,000 SDR enzymes which can be retrieved from UniProt [10]. Based on activity profiles and sequence identities, the available HHDH enzymes have been classified into three different phylogenetic subgroups, namely type A, B, or C [9]. HHDHs within each type share more than 97% sequence identity while the identity between enzymes of different types is below 33%. Due to these high sequence identities within each of the three subtypes, only three different HHDH enzymes are currently available for biotechnological exploitation. Although these few known HHDHs have already given rise to many interesting applications, it is of great interest to increase the number of functionally diverse HHDH enzymes [2, 188]. A viable approach to generate novel and functionally diverse enzymes employs rational and random protein engineering strategies which can yield drastically improved and functionally diverse enzyme variants [190]. Such strategies have been successfully applied to HheA2 [41, 176, 203] and HheC [34, 37, 47, 110] addressing specific drawbacks of the respective parental enzymes and yielding HHDH mutants with sometimes substantial improvements towards target reactions. Nevertheless, parental sequences will always govern the overall accessible sequence space in every protein engineering study. For example, HheC has been shown to be extraordinarily tolerant to mutations at 153 of its total 254 residues with a maximum of 42 simultaneous substitutions per variant described [47]. However, these heavily engineered enzyme variants are still rather similar to parental HheC, e.g. HheC-2360 [177] with more than 85% sequence identity. Clearly, novel sequences would be a valuable addition to the functional diversity of the HHDH enzyme toolbox. Further, novel HHDH enzymes might already exhibit activities or characteristics which are difficult to engineer or even unlikely to be accessible by laboratory evolution. Herein, we report the identification of novel HHDHs in publicly available sequence databases by making use of specific sequence motifs which allow for the unambiguous discrimination of true HHDH sequences from the vast number of other SDR sequences.

 *    

%  Sequence characteristics of HHDHs Firstly, in order to identify novel HHDH enzymes, all known HHDH sequences were inspected for distinctive residues which distinguish HHDHs from other SDR enzymes. As deduced from a previous ClustalW MSA and confirmed by mutational studies [9], all known HHDHs possess a catalytic triad of Ser-Tyr-Arg which aligns with the catalytic residues Ser- Tyr-Lys present in SDR sequences. Thus, the presence of Arg in the HHDH catalytic triad can be used as an initial criterion to filter for putative novel HHDHs from other SDR sequences. For the certain discrimination of novel putative HHDH enzymes, however, an additional identification criterion was identified from alignment as well as structural data of all known HHDHs (see below). The crystal structures of HheA2 [39] and HheC [38] show that both HHDHs possess a spacious anion-binding pocket which is formed in part by residues that align with residues of a Gly-rich motif responsible for nucleotide cofactor binding in Rossman-fold enzymes such as SDR enzymes [56, 58]. Specifically, the large residue F12 in both HheA2 and HheC is essential for the formation of the HHDH anion binding pocket and replaces a central small Gly or Ala in the T-G-x(3)-[GA]-x-G nucleotide-binding motif of classical SDR enzymes [38, 39, 56]. In the first ClustalW alignments of all known HHDHs [9], residue R7 of both known B-type enzymes aligned with residue F12 of the other HHDHs. The overall alignment quality, however, was low in this region. Later, an improved alignment of all known HHDHs was published which also incorporated structural information and showed that now Y27 of both B-type enzymes aligns with F12 of the other HHDHs [38]. Thus, in all known HHDH enzymes with sequence identities as low as 33%, the large aromatic amino acids Phe or Tyr disturb the commonly observed Gly-rich cofactor binding motif of SDR enzymes and might therefore indicate sequences with HHDH activity. Instead of a structural alignment algorithm, we used MAFFT [204] due to its high computational efficiency and accuracy in the alignment of thousands of individual sequences [205, 206]. Furthermore, MAFFT also correctly aligns F12 with Y27 as well as the catalytic triad residues of the respective known HHDH enzymes. In conclusion, we propose that HHDH enzymes can be discriminated distinctively from other SDR enzymes by the presence of the HHDH catalytic triad and a conserved aromatic Phe or Tyr which replaces the central small Gly or Ala in the T-G-x(3)-[GA]-x-G motif of classical SDR enzymes. Database mining To identify novel HHDH sequences, blastp searches were initiated to collect homologous sequences from GenBank which could then be assessed for the presence of both conserved HHDH sequence features. Initially, homologous sequences were collected by blastp from the GenBank nr protein sequence databases with the sequences of HheA, HheB, and HheC as queries. Afterwards, a MAFFT (FFT-NS-2) alignment was used to identify and exclude the large majority of putative SDR sequences with catalytic Ser-Tyr-Lys residues from the sequence pool. Then, after realignment with MAFFT (FFT-NS-2), sequences were removed which lacked the catalytic Ser-Tyr-Arg triad of known HHDH enzymes. This much smaller sequence set was then effectively aligned by MAFFT (L-INS-i) to identify putative novel HHDH sequences which also possess the conserved aromatic Phe or Tyr present in known HHDH enzymes. This entire

, (   process was iterated until no further putative novel HHDH sequence could be identified (Figure 2 ).

Figure 2. Flow scheme for the in silico identification of novel HHDHs. Using blastp, homologous protein sequences were collected and aligned by MAFFT to identify novel HHDH sequences by sequentially removing sequences which possessed the typical SDR catalytic triad of Ser-Tyr- Lys ( “S-Y-K” ), which lacked the conserved HHDH catalytic triad of Ser-Tyr-Arg ( no “S-Y-R” ), and which did not possess the specific aromatic Phe or Tyr ( no “F/Y” ). Through iteration, all of the novel HHDH sequences ( Table 1 ) were identified. From the GenBank collection of non-redundant (nr) sequences, 35,448 unique sequences were obtained using HheA, HheB and HheC as blastp queries. Of these, 23 sequences contained the catalytic triad of known HHDHs but only nine sequences also possessed the aromatic Phe or Tyr specific for known HHDH enzymes. Sequences not considered to be putative novel HHDHs were for example too short to contain the conserved aromatic Phe or Tyr of known HHDHs or possessed a variation of the Gly-rich T-G-x(3)-[GA]-x-G motif required for nucleotide binding in SDR enzymes. The nine putative novel HHDH sequences originated from Parvibaculum lavamentivorans DS-1 (HheA3), Arthrobacter sp. JBH1 (HheA4), Tistrella mobilis KA081020-065 (HheA5), Dechloromonas aromatica RCB (HheD), the marine gamma proteobacterium HTCC2207 (HheD2), Methylibium petroleiphilum PM1 (HheD3), Thauera sp. MZ1T (HheD5), the gamma proteobacterium IMCC3088 (HheE5) as well as from an uncultured bacterium (HheF) ( Table 1 ). Using these putative novel HHDH sequences as queries for subsequent blastp searches, another 37,469 unique sequences were collected from

,- *     the nr database. These contained one additional putative novel HHDH from Ilumatobacter coccineus YM16-304 (HheG) ( Table 1 ) with both the conserved HHDH catalytic triad and aromatic Phe or Tyr. Using the latter sequence as query for a following blastp search, another 2,035 unique sequences were retrieved but no additional sequence could be identified that contained the correct Ser-Tyr-Arg catalytic triad of known HHDHs in combination with the HHDH-specific aromatic Phe or Tyr.

Table 1. Sources and accession numbers of previously known (*) and novel HHDHs HHDH Organism or source Accession HheA* Corynebacterium sp. N-1074 BAA14361 HheA2* Arthrobacter sp. AD2 AAK92100 HheA3 Parvibaculum lavamentivorans DS-1 ABS64560 HheA4 Arthrobacter sp. JBH1 AFI98638 HheA5 Tistrella mobilis KA081020-065 AFK51877 HheB* Corynebacterium sp. N-1074 BAA14362 HheB2* Mycobacterium sp. GP1 AAK73175 HheB3 marine metagenome ( Ralstonia )a) EBL02020 HheB4 marine metagenome ( Shewanella )a) EBP61646 HheB5 marine metagenome ( Burkholderia )a) ECR06649 HheB6 marine metagenome ( Sorangium )a) EDB56284 HheB7 marine metagenome ( Bradyrhizobium )a) EDD65701 HheC* Agrobacterium radiobacter AD1 AAK92099 HheD Dechloromonas aromatica RCB AAZ44846 HheD2 gamma proteobacterium HTCC2207 EAS46473 HheD3 Methylibium petroleiphilum PM1 ABM93639 HheD4 marine metagenome ( Haliangium )a) ECY18578 HheD5 Thauera sp. MZ1T YP_002355872 HheE marine metagenome ( Acaryochloris )a) EBP63112 HheE2 marine metagenome ( Sorangium )a) ECW41905 HheE3 marine metagenome ( Burkholderia )a) EDF62577 HheE4 marine metagenome ( Catenulispora )a) EDH34310 HheE5 gamma proteobacterium IMCC3088 EGG28524 HheF uncultured bacterium BAH89601 HheG Ilumatobacter coccineus YM16-304 BAN03849 a) Taxonomic classification according to the NBC webserver. For the obtained putative novel HHDH sequences, no information on their activity can be retrieved from associated GenBank records. Except for HheA4, sequence identities to known HHDHs range between 32% and 48%. Although sequence HheA4 from Arthrobacter sp. JBH1 is annotated as “3-oxoacyl-acyl-carrier-protein”, it very likely represents a true HHDH enzyme since it is identical in 242 of 244 residues to HheA. For two further sequences, additional information from the NCBI website can be retrieved which indicates that these enzymes might be HHDH enzymes. Sequences HheA3 from P. lavamentivorans and HheA5 from T. mobilis exhibit only low sequence identities of 33% and 38% to HheA and HheA2, respectively. Nevertheless, HheA3 belongs to the “haloalcohol dehalogenase, classical (c) SDRs” cluster cd05361 of the conserved domain database [207]. This cluster is part of the Rossmann-fold NAD(P) (+) -binding and HheA3 is the threshold setting representative for this cluster also comprising sequences of HheA, HheA2, HheC as well as associated crystal structures. Due to its high sequence identity to HheA, also HheA4 belongs

,! (   to the specific hit list of a RPS-BLAST search [208, 209] for the conserved domain of cluster cd05361. In contrast, HheA5 from T. mobilis is not found among the specific hits of an RPS- BLAST search but has been annotated as “halohydrin epoxidase A” most likely arising from the submitting authors’ gene annotation algorithms [210]. In contrast, the remaining sequences are mostly annotated as “SDR enzyme” or “” but never as HHDH enzyme. As no further HHDH sequences could be retrieved from the nr database, the GenBank collection of non-redundant sequences from environmental sources (env_nr) was also surveyed for the presence of putative novel HHDH sequences. All known and putative novel HHDH sequences were used as blastp queries to collect 27,907 unique environmental sequences but the alignment of these sequences was more challenging than before. The majority of env_nr sequences was derived from shot-gun sequencing projects of environmental samples such as the Global Ocean Sampling (GOS) studies [211, 212]. Usually, these short sequencing reads were assembled into larger contigs which could cause some of the annotated open reading frames to extend beyond the sequenced and assembled boundaries. In consequence, the env_nr database contains a higher portion of either N- or C-terminally truncated proteins than the nr database which is also indicated by a 60% shorter average protein length of GOS protein sequences [211]. Apparently, during MSA construction, this higher portion of truncated proteins caused misalignment of the catalytic triad residues for the known HHDH sequences which were always included as an indicator for alignment accuracy and reliability. To circumvent this, sequences were removed from the original environmental sequence set which were shorter than 180 residues – a value which corresponds to 80% sequence length of the shortest putative novel HHDH HheE5. With this measure, the catalytic triad residues of the known HHDHs could be correctly aligned and 40 sequences were identified from the remaining 19,068 sequences with a HHDH Ser-Tyr-Arg catalytic triad. Of these, 10 complete sequences (HheB3 through HheB7, HheD4, HheE through HheE4) ( Table 1 ) also possessed the HHDH-specific aromatic Phe or Tyr. According to the taxonomic classification of the NBC webserver, each of the metagenomic nucleotide contigs which harbored a novel HHDH originated from bacterial strains of which eight were proteobacteria. Sequence identities with the known HHDH enzymes varied between 37% and 60% and, again, no GenBank record indicated HHDH activity as all these sequences were annotated as “hypothetical protein”. In summary, a total of 20 novel HHDH sequences were initially identified from more than 100,000 protein sequences present in the GenBank nr and env_nr databases based on conserved HHDH specific features ( Figure 3 ).

,& *    

10 20 30 40 50 60 70 80 90 100 ....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....| HheA* 1 ------MKIALV THARH FA GPAAV EAL TRDGYT VV CH------DATFA DAA ER-QRFESENPGT VA 53 HheA2* 1 ------MVIALV THARH FA GPAAV EAL TQDGYT VV CH------DASFA DAA ER-QRFESENPGT IA 53 HheA3 1 M ------ARSILI TDIA HFV GG PSARALLA EGARIYG V------DASFA DAAA R-AAF ETKIP GVKA 54 HheA4 1 ------MKIALV THARH FA GPAAV EAL TRDGYT VV CH------DASFA DAA ER-QRFESENPGT VA 53 HheA5 1 ------MPV T------DTAP RVALV TNATKYAGAP TVAALA SQGWQIIA H------DASFTDVAA R-AAW EADNQ GMTA 60 HheA6 1 ------M------TDKR IAIL TDATHFV GDAIA TRLST DGY QVFAV ------DPAF SDAGKR -TAF ESLGDGVVA 56 HheA7 1 ------M------LNN KIILV TDATHFL GKPGITALV RDGATVFA Q------DARFTDKQAR-DAF EE LIP GVTA 56 HheA8 1 M ------LEE AM S------ELA DTIVLI TDLEHFV GRPSAKALL EAGATVYGT ------DPAFA DANMR-SAA EAALP GLKT 63 HheA9 1 M ------PRTVLI TDVTRFI GIP GT KALL EE GY KVF GT ------DPDFTDD SKR -SAYEKACPGATA 54 HheB* 1 M ------ANGRKR EMA NG------RLA GKR VLL TNADAYMGEATVQVF EEE GAEVIA D------HTDLTKVGA------55 HheB2* 1 M ------ANGRKR EMA NG------RLA GKR VLL TNADAYMGEATVQVF EEE GAEVFA D------HTDLTKVGA------55 HheB3 1 ------MSK------RLEGKR VLV TQSNN YMGPA TVELF EKEGAIV TAD------SS DLTETEK------46 HheB4 1 ------MSG ------RIEGKR VLV TQAA DYMGPA TAELF TAEGAQVTT D------TS DLTQPGR------46 HheB5 1 ------MG------RLDGKR VLV TQADD YMGPV TLEVFA EE GAEVIA D------NSDLTDPSR------45 HheB6 1 ------MDK------RLA GKR VLI TQAED YMGPAI TELFA EHGAEIIA D------TRDLTED GA------46 HheB7 1 ------MGD------RLA GKR VLV TQADNYMGPA TIELF TEE GAEVLA D------HSDLTVA GR------46 HheC* 1 ------MST AIV TNVKH FGG MGS AL RLSEAGHTVA CH------DE SFKQKDE L-EAFA ETY PQLKP 53 HheD 1 ------MSNQ ------SLV GKR VLI TQADMFM GPVL CEVFA RH GATVIA N------TDALLAP DA------47 HheD2 1 ------MNN S------QLA GKR ILV TQADTFM GPTLCEVFA EMGAEVIA D------NN LL TDPAL ------47 HheD3 1 ------MTG NP---- LSLSG RR ALI TQADAFM GPAL CEVFAA HGADVIA D------TS PLA DADA------49 HheD4 1 ------MSS N------DLKGKR ILV TQADTFM GPTLCEVF TEKGAEVI RD------NQ LL TEPTL------47 HheD5 1 ------MHAN------SLSG RR VLV TQADAFM GPAL CEAF RAA GAEVVP D------QSALL ERGA------47 HheD6 1 M ------TATHPSQSS RA------LL QDKR ILV TQAED FM GPAL CRTLA SHGAEVV ED ------TLPLMP TG A------55 HheD7 1 ------MEN------ALA GKR VLI TQADAFM GPVL CEVFA EQGAEVVA S------ADD LAVV DA------46 HheD8 1 ------MHAI ------SLSG RR VLV TQADAFM GPAL CDAF RAA GAEVVP D------RSALL ERGA------47 HheD9 1 ------MNGI------SLV GRR VLV TQAED FM GPAL CAAFAAA GAELIA D------RSAP RQPGS ------47 HheD10 1 ------MI ------DLRGQRILV TQAQDFM GPAL CQELRACG AEVIA D------DRVL TAP QD------45 HheD11 1 ------M---- INLV GKR ILV TQANAFM GPAL CEVL TEYG ADVIA S------ED SLI DATA------45 HheD12 1 ------MTPA ------DLTG KR ILL TQADAFM GPAL HNML SRCG AQVIA D------TGT LDTREA------47 HheD13 1 ------MQ------HLTDKR ILV TQADAFM GPAL CATLA DHGAHVI SD------TRVL TLP HD------45 HheD14 1 ------MNSA------QLA GKR ILV TQADTFM GPTLCEVF TEMGAEVIA D------NN LLI DPSL------47 HheD15 1 ------MP DT------LL SG KR VLI THADLFM GPVL CEVFA KH GATVIA S------NDPL TG ED T------47 HheD16 1 ------MTG NP---- LTLSG RR ALI TQADAFM GPAL CEVFAA HGADVIA D------TS PLA DADA------49 HheD17 1 ------MSS Q------RLA GLRVLI TQANEFM GPTLCEVFA EQGAVVLA D------DGPL TDPQA------47 HheD18 1 ------MSS Q------CLA GLRVLI TQANEFM GPTLCEVFA EQGAVVLA D------DGPL TDPQA------47 HheE 1 ------MKQRTVLV TC VDKYMGRAIV DRLTELDFRVL TD------TQALV EQSQ------42 HheE2 1 ------MGD------RLTG KR VLV THADRYMGAPVA ERFRAEGAEVIA D------TS VP RSAA E------46 HheE3 1 ------M------RLENKR VMV TQSDD YMGPAI TS LF ST EGAQVTT R------EKPVP TG KA------44 HheE4 1 ------MT------RLEHKK VLI TQSDD YMGPAIA DLFAA EAA RVTAR------PGLVPF GT Q------45 HheE5 1 ------MS------QFSG KSVWV TS ADRYMGPSIA DE FERLGAIV TRD------MHVL YDD HY------45 HheF 1 M ------TEQPQKNGY ------GLSG KR VVI TQAA GFM GPSLV EAF SREGAEVIP D------HR DLTHDKA------53 HheG 1 ------MSN------AENRPVALI TMA TGY VGPALA RTMA DRGFDLVL HGT AGDGT MV GVEE SFDSQIA DLA KR -GADVL TISDVDL 74 DHRS4 1 M HK AGLL GLCARAW NSVRMA SSG MTRR DPLA NKVALV TAST DGIGFAIA RR LA QDGAHVVV S------SRK QQN VDQAVA TLQGEGLSVTG 85

110 120 130 140 150 160 170 180 190 200 ....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....| HheA* 54 L ----- AEQKPERLV DATLQHGEAI D-- TIV SNDYIP RPM NRLPI EGTS EADIRQVF EAL SIFPILLL QSAIAPL RAA GG ASVIFI TSS VGKK PLA YNPL 146 HheA2* 54 L ----- AEQKPERLV DATLQHGEAI D-- TIV SNDYIP RPM NRLPL EGTS EADIRQMF EAL SIFPILLL QSAIAPL RAA GG ASVIFI TSS VGKK PLA YNPL 146 HheA3 55 L ----- SAQDPREAVAAVL EAEGRLD-- VLI NN DAWPAM RG-- PV DE ATDKDLHETFEALVF KSFAM TRAAVP QMKK QRAGKILFL SS AAPL NGIP NYS I 145 HheA4 54 L ----- AEQKPERLV DATLQHGEAI D-- TIV SNDYIP RPM NRLPI EGTS EADIRQMF EAL SIFPILLL QSAIAPL RAA GG ASVIFI TSS VGKK PLA YNPL 146 HheA5 61 A ----- EAQDPA GLIA EVRDRMGG LH-- GIV SNDAYPAI RR -- RIEE TEAEAL REML EAL TVFPFALA SAV TPHLKAQGAGAIVMV TS ASPRR PYPGFAM 151 HheA6 57 L ----- THTS AGDVI EHVM SEAGHID-- LLA SNDAYPAI RA-- PL TDITS DAL RDTLEALVV KPF DFA SHVA GHMKARK QGKIVFL TS AAPL NGLP NYAM 147 HheA7 57 L ----- AEQVP SEVI DVVL SKAGQLD-- VLV NN DAFPAI KA-- AI DE AELSDFTDTLNALLV RGFDYAKH ASKH MKGRGCG KILFV SS AVP KH GLP NYS M 147 HheA8 64 V ----- SGS DPVVAA GRVL EE SG RID-- VLI NN DAYPAL RA-- PL DTAKDED LEATY EALVF KPF RVTRAIVP SMKSAGGG KVLFL TS AAPL NGLA NYS M 154 HheA9 55 L ----- LP DSS DNLV KATVEAGGT LD-- VLI NN DAYPA HR A-- SVEE ASDE LL RK TFDMLFF KAYAMA RAAVP QMKK QGSG KIIF NTS AAPL NGLRNYS V 145 HheB* 55 ------AEE VV ERAGHID-- VLVA NFAV DAHFG-VTVL ETDEE LW QTAYETIV HPL HR ICRAVLP QFYERNKGKIVV YGS AAAM RYQEGALA 138 HheB2* 55 ------AEE VV ERAGHID-- VLVA NFAV DAHFG-VTVL ETDEE LW QTAYETIV HPL HR ICRAVLP QFYERNKGKIVV YGS ATAM RCHEGALA 138 HheB3 46 ------CQDIV ESAGII D-- ILIA NLA SENFSG -IP TS DLTDED WNTT FDVMV HPL HK LSRAVIP QMV ERQAGKIIV YGS ASAL KGMRTLAA 129 HheB4 46 ------CEALI ESCG EID-- ILVA NLA SPNFSG -IA TAELSDDD WQTAF DMMV HPL HR LCRAVLP QMI ERKK GKIVVF GS AAAL KGMKTLA T 129 HheB5 45 ------AAALI EE TG HID-- VLVA NLAAPA NLG-VAAA DMP DD IW QTMF DVMV HPL HR LTKAVLP QMV ERQKGKILV YGS ATG VKGMA GITA 128 HheB6 46 ------VESLI KSAGEID-- VLIA NLAAPA HLG-LSVTDTDD TT WETAF DVMV HPL HR IF RSVLPAM YERKR GKIVVI GS ATG LKAL EGVVA 129 HheB7 46 ------CS ALI KEHGRID-- VLVA NLA SPNFTG -IPA TELA DED WHCT FNMMV HPL HQLCQAVVP QMI ERQAGKIVVF GS ATAL KGMP TVTA 129 HheC* 54 M ----- SEQEPA ELI EAV TS AYG QVD-- VLV SNDIFAP EFQ-- PI DKYAV ED YRGAV EAL QIRPFALV NAVA SQMKKRK SG HIIFI TS ATPF GPW KELST 144 HheD 47 ------PA TVVA QAGQID-- ALVA NLAMPAP TT -- AA TEVSDEE WRDTFAILV DPLA RLL RAALPAMI ERR SG KILVM GS ASAL RGMKR AST 129 HheD2 47 ------PA KII QQ AGHID-- VLVI NLAIPAPF T-- KGELV DD SEWSATFSAVV DPMP RLCT AVLP QMI ERQGG KILVM GS ASAL RGMKR AST 129 HheD3 49 ------PA RVVAAA GVV D-- LLVL NLAIPAP RT-- SAVA SS DAEWAAVF GALV DPLP RLL RAVLP QMIA RR AGRVVLM GS AAAL RGMKNSST 131 HheD4 47 ------PA QLVAAA GS ID-- VLVI NLAIPAPF T-- KLEKVDD NEWESVF SAVV NPMP RLV SAVLP QMI ERQSG KILVM GS ASAL RGMKR AST 129 HheD5 47 ------GRAVI EAA GRID-- VLVL NLAIPAP ST -- PV HQVSDGEWETT FAALV HPM REMVAAVLP QMI ERK AGKILLM GS AAAL RGMAL RSS 129 HheD6 55 ------AI EVI RNAGDVD-- VLVA NLAI EAP ST -- QAGEVTEDE WRSVF SALV DPLP RLV KAVLP GMI ERGKGKILVM GS ASAL RGMKR AST 137 HheD7 46 ------AERVV RAA GHID-- VVVA NLAI KAP ST -- AAV EVTDAEWRDVFAALV DPLP RLV RAAAPAMV ERR AGKILLM GS ASAL RGMKR AST 128 HheD8 47 ------GRAVI EAA GRID-- VLVL NLAIPAP ST -- PV HQVSGG EWETT FAALV HPM REMVAAVLP QMI ERK AGKILLM GS AAAL RGMAL RSS 129 HheD9 47 ------GRAVI ESAGHID-- ALVL NLAIPAP ST -- PA HQVSDDE WEAAFAALV HPM REMVAAVLP QMI ERK AGKIVLM GS ASAL RGMA RR SS 129 HheD10 45 ------AQAMI KEAGPLA -- ALVI NLALPAP ST -- PV TQIDE AEWHQVF EVMV DPLP RLV RAVVP GMKARGGG KIVVM GS ASAL RGMKR AA S 127 HheD11 45 ------PA QLIA GVGRID-- VLVA NLALAAP ST -- VA EQVSDEE WDTVFA SLV NPLP RLCRAILPAM QARR SG KILVM GS ASAL RGMKR AST 127 HheD12 47 ------VDQLV SQSG HID-- VLVA NLGVPAP ST -- AAV QVSDDE WRLMF THMV DPL QQ LTRAILPAMI ERQCG KILLM GS ASAL RGIKR ASS 129 HheD13 45 ------SQALI DSAGPL D-- VLVI NLAMAAP ST -- PVA DIQDDE WRQVF EVMV DPLP RLL RAVVP GMKARGGG KVIVM GS AAAL RGMKR TAS 127 HheD14 47 ------PA KII QRAGRID-- VLVI NLAIPAPF T-- KGELV DD TEWSATFRAVV DPMP RLCT AVLP QMI ERQGG KILVM GS ASAL RGMKR AST 129 HheD15 47 ------PAALVA ESG PL D-- ALVA NLAIAAP TT -- LTT EVTEQEWQETFGALV HPLA RLCRAVLP GMI EKR AGKIVVM GS ASAL RGIKR TST 129 HheD16 49 ------PA RVVAAA GVV D-- LLVL NLAIPAP RT-- SAVA SS DAEWAAVF GALV DPLP RLL RAVLP QMIA RR AGRVVLM GS AAAL RGMKNSST 131 HheD17 47 ------PA RLVA GHGPI D-- VLVA NLAVPAP ST -- PA HQVSEQEWRDTFAALV DPLP RLCQAVLP DMMA RR SG RILVM GS AAAL RGMKR TST 129 HheD18 47 ------PA RLVA GHGPI D-- VLVA NLAVPAP ST -- PA HQVSEQEWRDTFAALV DPLP RLCQAVLPAMMA RR SG RILVM GS AAAL RGMKR SST 129 HheE 42 ------CEE LV RSVGEVD-- ILIA NLA EPP RSS -- PV QAI QN ED WTT LF ST LV DPLMFLV RAI TPQML DRQSG KIIAV TS AAPL KGLA NN AS 124 HheE2 46 ------GAAIA EAA GAI D-- VLFA NLAWPP TPA -- LV TDTS DED WHALF DVLV HPLM GLV RAAA KTMKGAGGG RII GMTS AAPL RGIP RNSA 128 HheE3 44 ------FST WV REMPV YD-- VVVA NLA HDPCSS -- AV DNIDNED WQALF ETLV HPLM YLV RH FAP KMA ERGYG KIIAI TS AAPL RGIP GST A 126 HheE4 45 ------FA RYVQGLP DFD-- VVIA NLA HDPCNG-- PI ETLA DE SWEKLF DTMV HPLMALV RH FAP RMA DQGHGKIIAI TS AAPL KGLP GS AA 127 HheE5 45 ------LRETLA EIP -- D-- IVIA NLA EPP RK D-- AL EAI QDDD WNLLF DHLV HPLM RIV RH VSG PM KARGHGKIVAI TS AAPL RGIPFA SG 125 HheF 53 ------ADNLV SEFKEID-- ILLI NLA SQRQRI-- EATEISDQQ FL QPF EE MV YPLF RLGRSVLP QMIA RRR GKIIVI GS AAPL RPFA NATG 135 HheG 75 T----- TRTG NQ SMI ERVL ERFGRLDSACLV TG LIV T--- G-- KFL DMTDD QWA KVKATNLDMVF HGLQAVLPPMVAA GAGQCVVF TS ATGG RPDPMV SI 164 DHRS4 86 TVCHVGKAED RERLVA TAV KLHGG ID-- ILV SNAAV NPFF G-- SIM DVTEE VW DKTLDINVKAPALM TKAVVP EMEKR GGGS VVIV SS IAAF SPSPGFSP 181

,. (  

210 220 230 240 250 260 270 280 290 300 ....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....|....| HheA* 147 YG PA RAA TVALV ESAA KTLSRDGILL YAI GPNFF --- NN PTY FP TS DWENN PELRERVERDVPL GRLGRPDE MGALI TFLA SRR AAPIV GQFFAF TGGY L 243 HheA2* 147 YG PA RAA TVALV ESAA KTLSRDGILL YAI GPNFF --- NN PTY FP TS DWENN PELRERVDRDVPL GRLGRPDE MGALI TFLA SRR AAPIV GQFFAF TGGY L 243 HheA3 146 YAAA RGAA NSLAL TLA KELAP SNIQVNALAF NFI --- ESPDYFP -ASLL ENPKSRDKIL SNIPL GRLGKPEE AAAIVAFLA GPTS DFI TG QLIPVA GG WA 241 HheA4 147 YG PA RAA TVALV ESAA KTLSRDGILL YAI GPNFF --- NN PTY FP TS DWENN PELRERVERDVPL GRLGRPDE MGALI TFLA SRR AAPIV GQFFAF TGGY L 243 HheA5 152 YATARSASTG LA KALA NELAP HGIRVNAVAP NFL --- YS ETYY PRAKWI DD PA GAA RVREMVPL GRLGRPEE IGELIAFLL SDKAGFVV GETVGFTGG WP 248 HheA6 148 YC AA RGAM NAAV KALA KELGPSNVQVNAIAP NFV --- ANPDYFP -PELMA DPDKAP KIL KNIPL GRLGKPEE VAAMVA QLA SED GG FF TG QVIAA SGG WA 243 HheA7 148 YVAA RGG ANALAI SLA KELGRH GIQVNALAP NFI --- ESPTY FP -KELL ENEE AYKK ITKPIPL GRLGKPEE AGEYLAFLA SDRSDYITG QVL YFA GG WA 243 HheA8 155 YASARGAA NSLML SLSRELAP RNIQVNAVAP NYV--- ENPDYFP -PELLA NEE AMA KIL KNIPL KR LGKPEE AAALI TFLA TPL SG FI TG QVIPLA GG WA 250 HheA9 146 YVSARGAA NALA RTLAL ELAA DNIQVNAVAP NFV --- DNPDYFP -PELMA DPDKAA RIL KNIPM GRLGKPEE AA SLLA YLA SEKSS FI TG QIIPLA GG WA 241 HheB* 139 YST ARFA QRGY VTAL GPEAA RH NVNVNFIA QHWT--- QN KEYFW -PERIA TDE FKED MA RR VPL GRLA TARED ALLALFLA SDE SDFIV GKSIEFDGG WA 234 HheB2* 139 YST ARFA QRGY VTAL GPEAA RH NVNVNFIA QHWT--- QN KEYFW -PERIA TDE FKED MA RR VPL GRLA TARED ALLALFLA SDE SDFIV GKSIEFDGG WA 234 HheB3 130 YS AA RAA QVGY VQSLGVELAP HNIQVNLIA QN YV--- ENPI YY P-ESLRSQEKFQN SLKR QVPL GRLA RAEED AQFALFLA SNESDFFV GQAIPF SGG WV 225 HheB4 130 YS AA RAA QVGY VQSLGVEVAP HNVQVNLIA QN YV--- ENSVYY P-PELQQN EGFKK SLRR QVPL GRLA TAKED ALFALFLA SEE SDFFV GQAIPF SGG WT 225 HheB5 129 YS AA RH AQVGY VRSTG AEIA GHNIQMNLIA QN FV --- ENPV YFP -PQFTETPEFKELL K-GVPA GRLA TARED ALFAVFLA SDE SDFFV GQAIPF SGG WA 223 HheB6 130 YG AA RH AQVGY VRAA GIEAA RH NVQINLIA QN FV --- ENPA YFP -PEFTASNEFKQLL SAHVPL GRLA TARED ALFALFLA SDE SDFFV GQAIPF SGG WA 225 HheB7 130 YS AA RAA QAGY VRSLGVEIA KH NIQANLIA QN YV--- ENPV YY P-PELREKESFQKSLA RQVPL GRLA TARED ALFALFLA SDE SDFFV GQSIAF SGG WA 225 HheC* 145 YTS ARAGACT LA NAL SKELGEYNIPVFAI GPNYLHSED SPYFYPTEPW KTNPEHVA HVKK VTAL QRLGT QKELGELVAFLA SGSC DYLTG QVFWLA GG FP 244 HheD 130 YS AA RGAQLA YVQAV GVEVAP HNVQVNAIA QN FV --- DNPTY FP -PEVQANPRFQERLKR EVPL GRLV GADED AEFAA YLCS DSANCFV GQVFPV CGG WV 225 HheD2 130 YS AA RGAQLSY VKAM GVEMAP QGIQINAIA QN FV --- DNPTY FP -EE TKANPKFQERLKR DVPL GRLV SLRED ALFAA YLCS DAA DCFV GQVFPV SGG WA 225 HheD3 132 YS AA RGAQLA YVQAV GVEVAP HGVRVNAIA QN FV --- DNPTY FP -PEVQADPRFQERLKR EVPL GRLV GARED AEFAA YLCS EPA DCFV GQVFPV CGG WV 227 HheD4 130 YS AA RGAQLA YVQAA GVEMAP HNIQINAIA QN FV --- DNPTY FP -EE TKANPAF QDRLKR EVPL GRLV SMRED AQFAA YLCS DAA DCFV GQVFPV SGG WA 225 HheD5 130 YAAA RGAQLA YIQAV GVEAAA HGVQVNAIA QN FV --- ENPTY FP -PEVQATPAF KDRLKWQVPL GRLV TADED ASFAV YLCS EAA NCFV GQVFPV CGG WV 225 HheD6 138 YS AA RGAQLA YVQSVGVELA KYNIQVNAIA QN FV --- DNPTY FP -KEVQENPRFQDRLKR EVPL GRLV SPEED AEFVA YLCS DAA NCFV GQVFPV SGG WA 233 HheD7 129 YS AA RGAQLA YVQAI GVELAA HNVQVNAVA QN FV --- DNPTY FP -PEVQANPRFQERLA REVPL GRLV SARED ALFAA YLCS PAA DCFV GQVFPV CGG WV 224 HheD8 130 YAAA RGAQLA YIQAV GVEAAA HGVQVNAIA QN FV --- ENPTY FP -PEVQATPAF KDRLKWQVPL GRLV TADED ASFAV YLCS EAA NCFV GQVFPV CGG WV 225 HheD9 130 YAAA RGAQLA YIQAV GVEAAA HGIQVNAIA QN FV --- ENPTY FP -PEVQATQAF KDRLKWQVPL GRLV TAEED AGFAV YLCS EAA NCFV GQVFPV CGG WV 225 HheD10 128 YS AA RGAQLA YVQAA GVELAP DNIQLNAVA QN FV --- DNPTY FP -PEVQANPRFQERLKR EVPL GRLV KAEED ARFVA YLCS DAA SC FV GQVFPM SGG WA 223 HheD11 128 YS AA RGAQLA YVQAL GVEVAPF NIQVNAIA QN FV --- DNPTY FP -AEVQTNPRFQERLQREVPL GRLVAA RED TT FAA YLCS DAA NCFV GQVFPVA GG WA 223 HheD12 130 YS AA RGAQLA YIRAV GVEVA RH NIQVNAIA QN FV --- DNPTY FP -AEVQANTAF QERLRR EVPL GRLV SAEED AQFAA YLCS DAA NCFV GQIFPV CGG WA 225 HheD13 128 YS AA RGAQLAFV QAV GVELAP DNIQVNAIA QN FV --- DNPTY FP -AAV QANPRFQERLKR EVPL GRLV SAQED TS FVA YLCS DAA DCFV GQVFPM SGG WA 223 HheD14 130 YS AA RGAQLA YVKAM GVEMAP QGIQINAIA QN FV --- DNPTY FP -EE TKANPRFQQ RLENDVPL GRLV SLRED ALFAA YLCS DAA DCFV GQVFPV SGG WA 225 HheD15 130 YS AA RGAQLA YVQAV GVEVAP YNIQVNAIA QN FV --- DNPTY FP -ADVQANPRFQERLKR EVPL GRLV SAKED AEFAA YLCS ENANCFV GQVFPV CGG WV 225 HheD16 132 YS AA RGAQLA YVQAV GVEVAP HGVRVNAIA QN FV --- DNPTY FP -PEVQADPRFQERLKR EVPL GRLV GARED AEFAA YLCS EPA DCFV GQVFPV CGG WV 227 HheD17 130 YS AA RGAQLA YVQAL GVEVA SHNIQVNAIA QN LV --- DNPTY FP -PEVQADPRFQERLRR EVPL GRLV SARED AEFAA YLCS APAA CFV GQVFPV CGG WV 225 HheD18 130 YS AA RGAQLA YVQAL GVEVA SY NIQVNAIA QN LV --- DNPTY FP -PEVQADPRFQERLRR EVPL GRLV SARED AEFAA YLCS APAA CFV GQVFPV CGG WV 225 HheE 125 YC AA RGAQN AFI KAV GLELA RSNIQVNAIA QN YI--- NNN TYY P-SRLL DDE KFL DHVRR NVP TNQ VGSS EE TAELAA YLA SEKCNHMV GQIIPLA GG WA 220 HheE2 129 YC AA RGAQN AYLRAA GLELA RDGVLV TAIA QN YV--- ENDTYY P-PGLTEDE QFLA RMRGVVPA QRLGQPEE TAALALFLA T-EAGFVP GQVFPLA GG WT 223 HheE3 127 YC AA RGAQN AFI RATG LEFAA RGVNINAVA QN YV--- SNPA YFP -DD LVA SERFQKH LA RNVPI GRVA KDTESAELALFLA SNASDFIV GQVVPF SGG WA 222 HheE4 128 YC AA RGAQN AFI KAV GLEFAA KNVQINAIA QN YV--- SNPV YY P-DE LV QSERFQKH LA RNVPI RR VA RPEE QAEFALFLA SNN SDFFV GQIFPF SGG WA 223 HheE5 126 YC AA RGAQN AFI KGAGLELA KFGVQANAI GQN YI--- ENDTYY P-PELM QDPRFI SNLSS QVP TKK VGRGLETAKLAA YLA DPDVEHVV GQIIPLA GG WT 221 HheF 136 YATARGAQLAWV KAV GAEVA QHNVQVNGIA QIFV --- ENQ EYFP -PA YLQTDE FKQRIA -QVPA GRLGS AA EHAAL SLFLA SDQCNFI SG QVVPFA GG WT 230 HheG 165 YGGT RAGANGIV RAV GLEHARH GVQVNAI GT NYM--- DFP GFL KASRADGDPERR AMI EAQVPL RR LGT MDE LSS VTAGLL DGS NRFQTG QFF DFSGG WG 261 DHRS4 182 YNVSKTALL GLTKTLAI ELAP RNIRVNCLAP GLI --- KTS -- FS-RMLWM DKEKEE SMKETLRIRR LGEPED CAGIV SFL CS ED ASY ITG ETVVV GGGT P 275

310 320 ....|....|....|....|....|.... HheA* 244 P ------244 HheA2* 244 P ------244 HheA3 242 TA------R------244 HheA4 244 P ------244 HheA5 248 ------248 HheA6 243 ------243 HheA7 243 ------243 HheA8 251 NA------R------253 HheA9 242 SA------R------244 HheB* 235 T------235 HheB2* 235 T------235 HheB3 226 Q------R------227 HheB4 226 Q------226 HheB5 224 Q------224 HheB6 226 Q------226 HheB7 226 Q------226 HheC* 245 MI E------RWP GMP E 254 HheD 226 M ------R------227 HheD2 226 V ------226 HheD3 228 P ------R------229 HheD4 226 QLKNEQHGMII SFI DSLSAKVI RC---- H 250 HheD5 226 N------R------227 HheD6 234 I ------R------235 HheD7 225 G------R------226 HheD8 226 N------R------227 HheD9 226 S------R------227 HheD10 224 V ------R------225 HheD11 224 I ------K------225 HheD12 226 D------226 HheD13 224 V ------R------225 HheD14 226 V ------226 HheD15 226 T------R------227 HheD16 228 P ------R------229 HheD17 226 G------R------227 HheD18 226 G------R------227 HheE 221 T------221 Figure 3. MAFFT MSA of known (*) and novel HheE2 224 T------TL----- 226 HheE3 223 T------NT----- 225 HHDHs together with experimentally verified HheE4 224 I ------NA----- 226 HheE5 222 T------222 homologous SDR sequences DHRS4. Residues HheF 231 T------231 corresponding to the conserved T-x(4)-[FY]-x-G and HheG 262 A ------262 DHRS4 276 S------RL----- 278 S-x(12)-Y-x(3)-R motifs are highlighted (gray).

,/ *    

Experimental verification of HHDH activity In order to investigate if the identified sequences indeed represent enzymes with true HHDH activity, codon-optimized synthetic genes were ordered for heterologous expression of the respective proteins in E. coli . Prior to gene synthesis, all coding sequences were inspected for alternative translation start signals since nine of the 20 sequences were annotated with start codons different from the commonly observed ATG ( Figure 4 ). First, the sequences were inspected for the presence of an alternative standard ATG start codon downstream of the annotated translation start. Then, and only if the shorter alternative gene product also contained the conserved aromatic Phe or Tyr of known HHDHs, this curated sequence was used for further analysis. In addition, sequences immediately upstream of each curated ATG start codon were inspected for overall similarity to the TAAGGAGGTGA SD sequence of Escherichia coli required as ribosomal binding site. In all of the nine coding sequences with a non-standard start codon, an intact shortened gene product could be identified now starting from a standard ATG start codon (Figure 4 ). In each case, curated ATG start codons were always preceded by sequences with at least weak homology to the E. coli SD sequence. Except for HheA4, which likely possesses HHDH activity due to its exceptionally high sequence identity with HheA, synthetic genes coding for the remaining 19 putative HHDHs were ordered and cloned into the well-established pET-28a expression vector. Heterologous expression of soluble enzyme was optimized for each HHDH by varying parameters such as the expression host [ E. coli BL21(DE3) or C43(DE3)] and expression temperature (20, 30, or 37°C). The optimization of expression conditions ( Table 2 ) yielded visible bands of soluble enzyme for most HHDHs after Coomassie staining of polyacrylamide gels which were not present in empty vector controls ( Figure 5 ). Since cloning into pET-28a resulted in the N- terminal addition of a His-tag to the respective HHDH, the same bands were also detected in Western blots by making use of a His-tag specific Ni-NTA horseradish peroxidase conjugate (Figure 5 ).

Table 2. Conditions for the heterologous expression of HHDHs. HHDH Expression strain Conditions HHDH Expression strain Conditions HheA2 E. coli Top10 7 h at 37°C HheD2 E. coli BL21(DE3) 7 h at 37°C HheA3 E. coli BL21(DE3) 7 h at 37°C HheD3 E. coli BL21(DE3) 7 h at 37°C HheA5 E. coli BL21(DE3) 7 h at 37°C HheD4 E. coli BL21(DE3) 24 h at 20°C HheB2 E. coli Top10 7 h at 37°C HheD5 E. coli BL21(DE3) 7 h at 37°C HheB3 E. coli C43(DE3) 24 h at 30°C HheE E. coli BL21(DE3) 7 h at 30°C HheB4 E. coli BL21(DE3) 7 h at 30°C HheE2 E. coli C43(DE3) 24 h at 30°C HheB5 E. coli BL21(DE3) 7 h at 37°C HheE3 E. coli BL21(DE3) 7 h at 30°C HheB6 E. coli BL21(DE3) 24 h at 20°C HheE4 E. coli BL21(DE3) 7 h at 37°C HheB7 E. coli BL21(DE3) 7 h at 37°C HheE5 E. coli BL21(DE3) 7 h at 37°C HheC E. coli Top10 7 h at 37°C HheF E. coli BL21(DE3) 7 h at 37°C HheD E. coli BL21(DE3) 7 h at 37°C HheG E. coli BL21(DE3) 24 h at 20°C

, (  

A T T T C C T C C C A CC AA C CC A CC A A A C G AA T GG A C CC ...... G G G G T AAA G AA G C G G T C AA T C T GG C GG G GG GGGG T GG D D N G A Q E CC C A T C CC C CC A AAA A C CCC T C AA C A A T N A Q M A E T G G G G T AA GG G A AA G T A G A T A T C A C A T C C CC GG AA C GG C A A A A H D A Q T A C A T C A C AA A T CCC A T CC C A TT T D D V I S T N D D S T T T Y I G G A G G AA G A A G A A A GG GG T AA T T A G C GG C TT GG C D D A A L T V L C A CCC C A C CCC CC CC T T T A TT A CC T A A A A D A A I V P D V S V L H G G G G G A G G G G C G G G T C T C TTT G C C A CCCC G C G CC A A R A C T CC TT G T TT T A T T C CC AA A CCC T F F G D A Q D D P S TT G A A A C G A A G C G G G C G CC T C C C C GGGGG G G A C G A C C G T T A G F F I I I V K V L I Q G T A CC T T T C TT T AA T C C T S S E T T I V I V A V V A S G P TT A G A AA TT G G A G C A G T G 110 120 110 AA C G C GG T G GG C T T G T G GGG CCC C S S Q E D V L V R T C G A T A A T C AA G AA C T A T T A A K V T E G E I V G Q M G L V L G A C G G G GG C G A GG A C T G GG A GG T AA C C GG C GGG T A A W A M R K D A C GG AAAAA A C T G C CC A CC A A AA CCC A T A D D E Q A A H A E E T Y M D G G T G C G A C G G C AAA G G G G A A G C C C A C C T G C C A A C TT C D D G G K G T T A C tt C C CC C T G V V F A G G G A R F F G G TTT G GG a AAA GG G GG A G A C TTT TT C C GGG G C g C T GGG GGG C G AA T C H H Q G K A A AA a A T CC CC TT C A G G L G H M A A G T S D G G GG TT GG G A C A G G C G GG GGG T CC CC G G cc GGG G C AAA G GGG A C C C S E E A L L V A G G A AA AA * t C GG A T CC T A TTTT C CC G Y Y E E R E F A A E F I L V A R T A G G G G TT G G G TTTT G G C A ** gg GG G CC C T G GG T CC A A A L D R T TTT CC TT C CC * CCC CC T A G G CC T C CC A I I V A A A V R A L A A D A G G G G aa TT G G G C G T G 0 bp upstream of the annotated annotated the of upstream bp 0 C GG A *G A GG T CC C GG V V V V L F R N F G T TT C CC gg G A G CCC CC A CC R R T T F F E F R K K S T Q A G C A TTT *A C TTT TTT G AA TT T AAAAA A G AA G T aa T A A C C TTTT CC G C GG T T V D D CCC CC C C T g A T T G G CC A C A TT CCC T A A A F V V C A F S T H V T M G A G TTT G G T A G A C A A C T GG A G cc A GG GG G C GG A TTT C C Y Y A A E G S S L A C AA g A A T G A T G T AA GGG T CC G G P L E E L E R I E W T GG T CC TT a G G C A G C C A G T A A C GG T GG g GGG C T G G C G G C C G C G G V L M E AA T AA T a T G G C T G A A CC T T T G E E G E C C A C E A L V R G GG G C g T T A T G G C G ..|....|....|....|....|....|....|....|....|....| ..|....|....|....|....|....|....|....|....|....| he he curated translation start, nucleotides T GG GGG A G C C GG G GG A T GG D D T T T CC A C T C C tt M aaA T T cc T CC C G C T CCC A A M A L P L A P R S I V V A G A A G A A c C CC C A CCC T G G C T T C aa * g G A aa A G GG A T C R R P V L T GGG A C T c T T CCC G C T CCC G T Y Y T L V T G A V G T K R L C A G a aa G A T cc TT A AAA C The The manually curated translation start (capital CC G TT T T g *G g A T GGGG GG A GG GGG G C T T A P R T C CCCCC C c * a T ** G T T CCC L L L L A P P M P P M M G K A A G AA G CC t CC A A ggg A CC A A GG AAA C T A a ** gg CCC C a CCC T T C c C C GG L L G G D CCC TT c t *G g A C A G A A N N P G G F G A Y S G P G C GG AA CC *A TT a G tt A GG CC C G GGGG aa T aa GGGG GGG A* t GGG TT a T T A A A M G C CC CC T t c T T CC T g T TT A g G R R S G M M A M G F S G G T A A A G a A G G TTT A GG GG T GGGG T tt C T c C GGGGG C * aa G G C E E Y Y V I CCC A A AA T A A TT T A G c A T d (*).d A A Q M F F D F M D D Q L V G G T C A cccc TT TTT G TT A A G * t C TT G C T C TT a G G T C T T A C V V K A CC T C A A t aaaa T CC CC C C A C GG aa G A S S T Y M T A A Y S S G Y T G AAA G c A A ttttt G G T A A GG T 60 70 80 90 100 100 90 80 70 60 G AA TT T gg g T T GG a C A TT G G C T T D D R CC CC T A A a A A A A G G A ** ggg T A P P D D D Q D R Q Y G G A G G G aa c G G cc G A a A T GGG CCC G C A g t T g T CC A* c C A A A A A A G CC C T C CC CC C CCC a CC A CCC * tt M t CC A A A T A D T G T GG G G C G cc aa A A c * a GG A C T GG a t GG GG C cc GG GGG GG C g G c C P P N Q M TT C AA c c A A T g A T C T T a CC G G G G V L V A A Q Q Q A N A GG AA G C a C A C A G c C CCCC G t tt T G tt G T C GG aaa *G g GG G G T T Y T CC GGG AA C CCC C T c C A A T c * a T V V R Q T T I L T H K M G GG A A A A C A T C g A C A G aaa aaa G G C G G AA a ** gg A A K C TT CCC T c t TT T C aaa T C T AAAAAAA CC F F K T A T V M V Q T G A g c A G G c AA A G cc ** aa A TTT GG AAAAA C GG t g C A T g GG T G T t T C F F V L D A T T T c t T T G a T A T G cc T H H G R P C C GG C g C C g C G C a CCCC A G TTTT G GG aa c G CC t G C GG ccc g G H H A V L V V C A C AA TT T T GG c T T T a AA T A A E I L V V L I R R V L V K Q I L G C G TT G gg G A C a G C AAA g tt C C A GG A G C t A C A c T G C a g A R R V R C T GG T T T G cccc G G g GGG G T c t AA C I I L V L V R R G V L V N E A A C G C G C cc C C GG t A T G t a G G C A C A C g G G C a C A C c c A C A A R K T A C G G G ggg G g G C G AAAA a C T D D R R K K S R R R E T V G G A C a AA AA A c A C G g tt A G T G G AAAA c aaa C C G CC G c G A H H P G CC A t g T gg TT TT a T T T K K G G L G K L M P A C CCCC AA AA GG g c GG GG GG AA ccc g A CC ...|....|....|....|....|....|....|....|....|....|.. C T G C C C a g C CC cc C GG C CC a c T T T A D T C C G A c T CC G a G G t G S S G A S S G R R A A G A GG G tt t G G A g A GG C gg aaa a C G G A g A G t G G a t CC T T L V L V T T CC T AA cc c T TT T TTTT CC T t cc M M E V L L L L T M N C G A A G TT t a C C TT G A A a g A cc AA C C a t CC T ggg A A G g g a G G L V V L D R T TT T A TT GG c g G AA C C T c T aa A the novel HHDHs’ coding sequences. For each HHDH, 3 HHDH, each For sequences. coding HHDHs’ novel the I L I I L I I S Q P S L E A G C t C cc C tt tt g G C tttt AA A gg a G CCC TT AA A a a t ** gg A A T G R G CCC CC * a G g AAA C aa G G t c c * CC S S R Q S N N R A A G A * g C c c AA A C a g aa G C C C * t A GGG aaa g CC TT AAA t G G T A c *G g C I I V Q G T T T c c g C A A G cc ggg a t R R G N G A D N C A G ** gg GG a AA AA GG G C G T a t c AA c AA G G T * a C M aA tt c C A cc T C g T a G K K P K C G a CC C a A ** gg a t C A A S S N N T H G S AA CC GG cc A tt aa T AA A C AAA A c g ttt T GG G G G c gg G G G gg G G GGG g g gg G n n sequence above (boxed). Immediately upstream of t M M M T T T AA tt T t tt T T T t T T T G* tt T M M M M M M M M A A T a A a tt a A A c A A tt tt A 0 10 20 30 40 50 40 30 20 10 0 c aA g t c a g aA c a c M g *A a aa ttt a g of of the annotated coding sequence (gray) are shown. c a a A g cc A c g g a * aaA * a T g a c c tt M ccc g g G t gg g * g cc c t g * g * g t tt t a * * a * aa a a a T a t a T A g a a c * t T a c a c tt

G c * g c g c c cc G a g g gg T * t g A tt ** gg a *G g T c A gg ** ** gg aa cc t c c * g * a t * a a G cc t * aaa A ggg * a c t ** GG * aaa T a g a *G g T a a * g a t aa aa cccc a A * aaa G c G* c ** gg * a T a *G g ** g t c c ggggg T aaa t ** gg t ttttt a ** gg a t g g G* c ** gg c * AA ttt c A* c ** gg * ccc ** gg A* g AA c a ttt 10 10 t a ** gg a T g gg T a c g * a T T a cc t aaa - *A a A* g A t g ** aaa t aa a AA t GG aaa a g aa gggggg c g g cc A c T c T a a c g g g c tt T a g g G* g a t c t * g g *G t *A a t t g g t AA cc t g aa c * t tt gg g * a c g T cc cc AA g c gg t aa c tt *A a tt aa tt t a a g T a g c c g cc ** gg a T g aa g tt g g c g c c a g g a ggg * a c aa t t t t t g g c g tt c g Shine Dalgarno sequence TAAGGAGGTGA are highlighte are TAAGGAGGTGA sequence Dalgarno Shine a c gg g ccc tt g a cc a a g G* c cc g t c ggg aa t c tt 20 20 tt a tt cc a cc t g c tt cc c AA gg - tttt aa g aaaaa c g cc aaaa a c ccc aa c g g T c g a t t t a cc t c t g aa g t t a c c g a t t c a c aaa c c cc g ggg g c g a g t g g t aaa t c c cc gg ttttt t g a g aa g t a aaa a c a t t a c aaa cc a cccc a g c c tt g a cc cc g tt cc a g c c tt g aa g gg aa a ccc a c t a t c cc t t a c a t a c ....|....|....|....|....|....|....|....|....|....|. g g

Escherichia coli Escherichia

Annotated as well as curated translation starts of of starts translation curated as well as Annotated

HheE4 HheE4 HheA4 HheA4 HheA5 HheB3 HheB4 HheB5 HheB6 HheB7 HheD HheD2 HheD3 HheD4 HheD5 HheE2 HheE3 HheE5 HheF HheG gi|154250456:c3151874-3151725 gb|JQ671543:7462-7611 gi|389875858:39540-39689 gb|EN922702:c1486-1337 gb|EN684044:107-256 gb|EM081762:c1851-1702 gb|EP821021:831-980 gb|EP741241:c1553-1404 gi|71905642:c104381-104232 gb|AAPI01000006:c45167-45018 gi|124265193:c713811-713662 gb|EP986253:c1637-1488 gi|237653092:c2521696-2521447 HheE gb|EN683253:c1057-908 gb|EQ048127:c2643-2494 gb|EP665629:c2211-2062 gb|EP600567:c5234-5085 gb|AEIG01000098:c6319-6170 gi|255292458:29408-29557 gi|464097432:3979364-3979513 HheA3 HheA3 Figure 4. 4. Figure translation translation start (+1) as well as the first 120 bp nucleotide letters) is also indicated by the protei with the identical

,) *    

Figure 5. SDS-PAGE of cell-free extracts A) after Coomassie Brilliant Blue staining as well as B) Western blots. In B) , recombinantly expressed HHDHs were detected using the His-tag specific Ni-NTA horseradish peroxidase conjugate (Qiagen) together with the Pierce enhanced chemiluminescent Western blotting substrate (Thermo Fisher Scientific). To assess whether these recombinant enzymes possess true HHDH activity, a colorimetric activity assay for halide release from chloroalcohols 1,3-dichloro-2-propanol ( 1) and 2- chlorophenylethanol ( 2) as well as bromoalcohol 1,3-dibromo-2-propanol ( 3) was performed with cell-free extract (CFE) of cells containing the recombinantly expressed enzymes [177]. Afterwards, gas chromatography analysis was used to confirm the formation of the corresponding epoxides. Using this approach, for all but one of the novel HHDHs, halide release as well as epoxide formation could be detected for at least one of the three substrates. Only HheB3 did not exhibit any activity in the performed assays. However, this enzyme was also barely expressed in soluble form in E. coli as judged by SDS-PAGE and Western blot analysis ( Figure 5 ). Figure 6 summarizes the obtained specific activities of the different HHDH-containing CFEs as calculated from the amount of formed epoxide. CFEs containing the known HHDHs HheA2, HheB2, and HheC were included in the measurements as positive controls. For conversions with bromoalcohol 3, rather high background activities were observed in empty vector controls. In the past, such higher rates have already been described for the uncatalyzed epoxide formation from bromoalcohols [35, 107]. In contrast, control reactions with chloroalcohols 1 and 2 showed only marginal epoxide formation (<5%). Therefore, only results for the conversion of chloroalcohols 1 and 2 are given in Figure 6 . Nonetheless, again with the exception of HheB3, all HHDHs showed significantly higher epoxide formation in the conversion of 3 as compared to control reactions (data not shown).

,+ (  

Figure 6. Specific activities of CFE from recombinant HHDH expression. Activities are shown for the formation of epoxides 4 and 5 from chloroalcohols 1 (black) and 2 (white), respectively, after deduction of background activities from empty vector controls. For reference, activity data is also shown for CFEs from the expression of previously known HHDHs (*). It should be noted that reported specific activities are based on total protein of the used CFEs and are thus not comparable between individual enzymes as these values are largely affected by differences in protein expression. Similarly, conversion results from independent expression cultures per HHDH varied in absolute values due to differences in total protein content. Nevertheless, repetition of activity measurements for at least two different CFEs per HHDH confirmed the representativeness of the obtained activity data. Furthermore, applied reaction conditions were likely not optimal for each enzyme as pH and temperature optima of the individual HHDHs are not yet known. Instead, reaction conditions (pH 7.5 and 30°C) were chosen that have been used consistently in the past to report HHDH activities. In consequence, higher activities might be observed after optimizing reaction conditions for each individual enzyme. Nevertheless, true HHDH activity could be confirmed for 18 out of the 19 tested HHDHs. Fingerprinting for exclusive recovery of HHDH sequences After we confirmed that our approach specifically identifies sequences which exhibit true

, *    

HHDH activity, we further tried to optimize our search routine to identify even more distantly related sequences and, at the same time, simplify the overall procedure. In our initial blastp searches for novel HHDH sequences, the large majority of results was dominated by SDR sequences due to the large number of available SDR enzyme sequences [10] and the overall high similarity of HHDH with SDR enzymes [9, 38]. In consequence, distantly related sequences might not have been included in the 20,000 sequences retrieved per blastp query. To circumvent this, PHI-BLAST can be used to detect more distantly related sequences which match a user-defined pattern [213]. To recognize such pattern, the MAFFT alignment of all known and novel HHDH sequences was inspected for the presence of conserved motifs ( Figure 3 ). As expected from the outlined identification protocol, all identified HHDHs necessarily possessed the Ser-Tyr-Arg catalytic triad with Tyr consistently separated by three amino acids from Arg ( Figure 7B ). This is in agreement with enzymes of the SDR superfamily as the corresponding catalytic Tyr and Lys are also separated by three amino acids in more than 86% of the classified SDR enzymes which belong to either the classical, extended,. or intermediate subfamily [10, 56]. Surprisingly, Ser is always preceding Tyr by 12 amino acids in HHDH enzymes whereas the position of this upstream catalytic residue seems to be less conserved in SDR enzymes. For example, 184 of the 201 most similar sequences to HheC from UniProt/Swiss-Prot or PDB (E > 100) possess a catalytic Tyr which is separated by three amino acids from Lys and are thus likely classical, extended or intermediate SDR enzymes. Interestingly, for 18 of these 184 SDR sequences, an upstream catalytic Ser is not separated by 12 amino acids from Tyr but, instead, the distance between Ser and Tyr differs by up to two residues. Therefore and in contrast to SDR enzymes, all identified HHDHs seem to possess a catalytic triad which is resembled by pattern S-x(12)-Y-x(3)-R ( Figure 7B ) despite overall low sequence identities. Consequently, this catalytic triad motif can serve as seed pattern for PHI-BLAST searches. In addition to the HHDH catalytic triad pattern, the conserved aromatic Phe or Tyr was utilized to infer another HHDH-specific pattern. As deduced from the sequence logos in Figure 7A , the position of both aromatic HHDH residues corresponds to the central Gly in the T-[AG]-x(3)-G-x-G motif. The latter motif constitutes a variation of the commonly observed nucleotide binding motif which was observed for a selection of 718 most homologous SDR enzymes. The sequence logo for the corresponding HHDH enzymes revealed that, in contrast to the SDR motif, pattern T-x(4)-[FY]-x-G can be defined instead which is only present in HHDH sequences ( Figure 7A ). With either pattern as seed for PHI-BLAST searches, sequences of each phylogenetic HHDH type (see below) – namely HheA, HheB, HheC, HheD, HheE, HheF, and HheG – were used to query the nr database. As anticipated, these PHI-BLAST searches allowed for a much deeper look into sequence space since it was possible to retrieve sequences with an E value of 10 within the first few hundred hits. In contrast, E values of previous blastp searches did not exceed 0.01 within the first 20,000 results. Within these results, SDR enzymes were again included but, more importantly, also all novel HHDHs. Thus, each pattern alone is sufficient to recover the available HHDH sequences independent of the query sequence. In contrast to assessing tens of thousands of sequences, it is thus possible to reduce the sequence pool to only few hundred candidates without compromising the result quality by using either of the two patterns. Quite the contrary, this measure is far less demanding on the (computational)

,, (   alignment efforts and might also generate higher quality MSAs due to the exclusion of unrelated (contami- nating) sequences.

Figure 7. Partial MAFFT alignment of previously known (*) and novel HHDH sequences. From the complete alignment ( Figure 3 ), only excerpts are presented which show residues around A) the conserved Phe or Tyr or B) the Ser-Tyr-Arg catalytic triad residues in HHDHs as well as corresponding residues in two homologous, experimentally verified SDR enzymes, FabG and DHRS4 (conserved residues highlighted in gray). The given sequence logos above each of the alignment excerpts visualize the observed amino acid distributions in all HHDHs or in 718 homologous SDR sequences.

!-- *    

To our surprise, when performing these PHI-BLAST searches, 17 additional sequences were identified in the nr database which represent putative novel HHDHs ( Table 3 ). All these sequences (HheA6 through HheA9, HheD6 through HheD18) possessed the conserved aromatic Phe or Tyr ( Figure 7A ) together with the catalytic Ser-Tyr-Arg triad ( Figure 7B ) and originated from different alpha-, beta-, and gammaproteobacteria. Apparently, these sequences had been included meanwhile to the updated GenBank release and were now successfully recovered by our optimized PHI-BLAST queries.

Table 3. Sources and accession numbers of putative novel HHDHs HHDH Organism Accession HheA6 bacterium Ec32 CDO61292 HheA7 Sneathiella glossodoripedis WP_025899379 HheA8 alpha proteobacterium Mf 1.05b.01 WP_029639308 HheA9 alpha proteobacterium MA2 GAK44072 HheD6 Marinobacter nanhaiticus D15-8W ENO15189 HheD7 Thauera sp. 27 ENO82779 HheD8 Thauera aminoaromatica S2 ENO87252 HheD9 Thauera phenylacetica B4P ENO98837 HheD10 Limnohabitans sp. Rim28 WP_019427705 HheD11 Thiothrix disciformis WP_020394200 HheD12 Pseudomonas pelagia WP_022962804 HheD13 Betaproteobacteria bacterium MOLA814 ESS13801 HheD14 Gammaproteobacteria bacterium MOLA455 ETN91936 HheD15 candidatus Competibacter denitrificans CDI00977 HheD16 Methylibium sp. T29 EWS52496 HheD17 Curvibacter gracilis WP_027476209 HheD18 Curvibacter lanceolatus WP_031254602

Sequence identities of sequences HheA6 to HheA9 as well as HheD6 through HheD18 matched the other experimentally verified A- and D-type HHDHs between 32% and 64% as well as 59% and 99%, respectively. As seen before for the other novel sequences, none of the associated sequence records indicated any relation to HHDH enzymes. Although the above putative novel HHDH sequences have not been verified for their enzymatic activities, all 17 sequences very likely exhibit typical HHDH activity since their sequence identities to experimentally verified enzymes are well in the range of what has been observed for the initially discovered novel HHDHs earlier in this study. Currently, these putative novel HHDH sequences are under investigation for their HHDH activities. No further HHDH sequence could be identified by increasing the E cut-off value (up to 1000) or by using any of the other HHDH sequences as query. Although each pattern alone greatly reduces the total number of sequences to be assessed for the complementary sequence feature, still, SDR sequences were always included in all of the results. For the specific retrieval of only HHDH sequences, it is possible to combine both patterns to identify sequences which must agree in both of the aforementioned sequence characteristics. As both patterns are separated by 93 to 131 residues in the experimentally verified HHDHs, both motifs can be combined in pattern T-x(4)-[FY]-x-G-x(93,131)-S-x(12)- Y-x(3)-R. Now, using the combined pattern with HheA, HheB, HheC, HheD, HheE, HheF, or HheG as PHI-BLAST queries, only the previously identified ( Table 1 ) as well as the putative

!-! (   novel HHDHs ( Table 3 ) are retrieved as relevant hits (E < 0.001). From the results which were recovered with the combined pattern searches, other sequences had E values of at least 0.54 to 10, depending on the query sequence, and were often annotated as large membrane proteins (>1000 amino acids). As these other sequences exceeded largely the conventional lengths observed for SDR enzymes (<350 amino acids), these other sequences were not considered to represent enzymes with HHDH activity. Varying the gap length between none and 250 residues or by using any of the other HHDH sequences as query, no additional sequence was retrieved with both HHDH sequence features within the boundaries of relevant results. Despite the apparent impact that positional sequence information has on homology search results, subsequent PSI-BLAST searches did not result in the identification of any further HHDH sequences although all identified HHDH sequences were included during PSSM construction. Querying the updated env_nr database with separate or combined seed patterns did not result in the recovery of additional putative novel HHDH sequences. In summary, the use of restrictive HHDH-specific sequence patterns T-x(4)-[FY]-x-G or S-x(12)-Y-x(3)-R as well as their combination simplified the overall identification routine. In addition to the 20 earlier identified HHDHs, another 17 putative novel HHDH sequences were identified which likely possess HHDH activities ( Table 3 ). Phylogenetic classification of the HHDH enzyme family Since we suspected that our diverse HHDH sequences might be challenging for reliable phylogenetic inference, we decided to use FastME minimum evolution as well as PhyML maximum likelihood tree building algorithms as both ranked highest in a recent benchmark of phylogenetic tree building methods [214]. As phylogenetic tree reconstruction can also suffer from substantial bias caused by the underlying MSA, two different algorithms, MAFFT [204] and PRANK +F [215], were used for MSA construction since both outperform other algorithms in recent benchmark studies [205, 206, 214]. As it has been suggested that gaps in MSAs carry substantial phylogenetic signal [206, 216], none of the resulting MSAs was curated from unaligned columns. Overall, all generated phylogenetic trees were very similar independent of the underlying MSA or tree reconstruction algorithm but corresponding bootstrap values were higher for the PhyML phylogram on basis of the PRANK+F MSA ( Figure 8 ) than for the other trees ( Figure 9). In all trees, the previously known enzymes HheA, HheA2, and HheC clustered at a major clade together with novel HHDHs HheA3 through HheA9. This clade of A-/C-type enzymes diverged early from the major clade of remaining HHDHs. In this second major clade, the known HheB and HheB2 were grouped together with five novel B-type HHDHs from metagenomic origin (HheB3 through HheB7). In contrast, the remaining novel enzymes formed distinct additional phylogenetic branches expanding the previous classification of known HHDHs in A-, B-, and C-type enzymes [9]. In consequence, we propose to cluster the novel HHDHs in a total of four additional phylogenetic clades encompassing members of the D-, E-, F-, and G-type enzymes. For example, the five experimentally verified as well as the 13 putative novel D-type enzymes clustered together in one clade with shared ancestry to the B-type enzymes. Another clade was formed by enzymes HheE through HheE5 which diverged earlier from the lineage of B- and D-type enzymes. Except for the clade of E-type enzymes, minor differences in topology and bootstrap support were observed between all four trees concerning each clade’s individual

!-& *     members. Thus, it was difficult to conclusively analyze their true internal phylogenetic relation. Nevertheless for the A-, B-, C-, D-, and E-type enzymes, the outlined phylogenetic grouping was consistently observed in all four trees with high bootstrap supports.

Figure 8. Phylogram of previously known (*) and novel HHDH enzymes. The shown PhyML tree was constructed on the basis of a PRANK +F MSA which included the experimentally verified, homologous SDR enzymes DHRS4 and FabG as outgroup for rooting (percentages give bootstrap support at indicated nodes). The branch points of HheF and HheG, however, varied depending on the MSA or tree building algorithm. In both FastME trees, for example, HheF branched after segregation of D- type HHDHs but prior to the branch point of B-type enzymes with reasonable bootstrap support (54% and 76%). The PhyML trees, on the other hand, indicated the branch point of HheF before the segregation of B- and D-type enzymes with overall higher bootstrap confidence (86% and 90%). Consistently for all trees, low bootstrap confidence was observed for the placement of HheG (36% to 55%). In both trees which were built on the PRANK +F MSA, HheG branched prior to any other HHDH clade. In contrast, the FastME/MAFFT tree

!-. (   specified the HheG branch point prior to the major clade of A-/C-type enzymes while HheG diverged within the second major clade of B- through F-type enzymes in the PhyML/MAFFT tree ( Figure 9 ). Despite these inconsistencies, but since the overall placement of HheF and HheG did not occur within any other clade, we decided that HheF and HheG should become archetypes of their own phylogenetic subtype keeping in mind that, with the data at hand, their actual branch point cannot be determined with absolute certainty.

Figure 9. Alternative phylogenetic trees. Phylogenetic inference was deduced on the basis of A) PRANK +F MSA with FastME, B) MAFFT MSA with FastME and C) MAFFT MSA with PhyML tree reconstruction. All trees were rooted with DHRS4 and FabG as outgroups (percentages give bootstrap supports at indicated nodes).

Despite these minor variations and uncertainties, the overall classification into six different phylogenetic HHDH subtypes was observed for the majority of bootstrap trees independent of MSA or tree building algorithms. This suggests that, in principle, the HHDH enzyme family is reliably represented by the phylogram in Figure 8 .

!-/ *    

Phylogenetic placement of HHDHs in relation to homologous SDRs To elucidate the phylogenetic origins of HHDH enzymes, minimum evolution trees were constructed including all identified HHDH sequences as well as a broad selection of SDR sequences. In addition to all identified HHDHs ( Tables 1 and 3), 718 unique homologous sequences were used for FastME tree building which originated from organisms of all three domains of life and which possessed the typical Ser-Tyr-Lys catalytic residues found in SDR enzymes. Besides sequences with only putative SDR activity, even a well-studied human SDR enzyme, the dehydrogenase/reductase SDR family member 4 (DHRS4), [217–219] as well as experimentally verified 3-ketoacyl-acyl carrier protein reductases from Vibrio harveyi (FabG) [220] and Burkholderia pseudomallei [221] were included. The resulting tree revealed that HHDH enzymes form a monophyletic clade which did not contain any SDR sequences ( Figure 10 ). Bootstrap analysis confirmed the proposed branching for the majority of HHDHs (HheA through HheF) for 95% of the bootstrap replicas. Again, as observed for the phylogenetic classification of the HHDH enzyme family (see above), branching of HheG showed only low bootstrap support (42%) but, still, the observed monophyletic clustering of HHDHs was confirmed for the majority of replica trees.

Figure 10. FastME tree constructed from a MAFFT MSA of all 42 HHDH (red) and 718 homologous SDR sequences. Proposed nomenclature As a consequence of the growing number of HHDH sequences, we propose to adopt a general naming scheme for genes and enzymes of the HHDH enzyme family on the basis of

!- (   their clustering to any of the phylogenetic subtypes A through G and, if necessary, additional phylogenetic subtypes. Then, numbering of enzymes within each subtype shall be consecutive according to the time of submission to public sequence databases such as GenBank. Throughout this study, this nomenclature has already been implemented ( Tables 1 and 3) and should help to avoid future conflicts or inconsistencies. A repository of available HHDH enzyme sequences will be maintained online (http://tiny.cc/hhdhs) and will be updated regularly.

*   The exponentially growing deposition of sequence information in public databases currently outpaces efforts to biochemically characterize novel biocatalysts [222]. Moreover, annotation of a large portion of sequence records lacks proper information on their true activity or, in the worst case, bears no functional information at all [223]. For this reason, the accuracy of automated protein function annotation methods is crucially important [224]. To aid the in silico assignment of enzyme function, specific enzyme family fingerprints can facilitate the identification of novel biocatalysts. In order to expand the current short list of HHDH biocatalysts, we extracted HHDH specific sequence motifs which allowed for the first time to discern true HHDHs from the large majority of homologous SDR sequences. First, the consensus motif S-x(12)-Y-x(3)-R represents the typical HHDH catalytic triad present in the five previously known and 37 novel HHDHs. This consensus motif is more precise than the less specific pattern S-x(7,17)-Y-x(3)-R which has been used in the past − then, however, without any reports of specific enzyme sequences nor their HHDH activities [54]. Additionally, motif T-x(4)-[FY]-x-G was deduced which specifies conserved residues in the nucleophile-binding pocket architecture of HHDH enzymes and which align with residues of the SDR nucleotide-binding motif. The combination of both features is highly effective in discriminating HHDHs from SDR enzymes, while the use of the HHDH catalytic triad pattern alone in PHI-BLAST searches retrieved also sequences which only possessed the typical T- G-x(3)-[GA]-x-G nucleotide-binding motif found in SDR enzymes. Surprisingly, among the sequences with an HHDH catalytic triad, no other amino acid than either the central Phe/Tyr or Gly/Ala typical for HHDHs or SDRs, respectively, was observed for the nucleophile binding motif. This limited diversity was unexpected as Fox and coworkers observed at least Leu or Ile as non-detrimental substitutions at position F12 in HheC mutants [47]. In wild-type HHDHs, the aromatic rings of Phe or Tyr might provide an evolutionary benefit for the stabilization of cleaved and positioning of incoming anions during epoxide ring-formation and -opening, respectively. Due to the still relatively small number of identified HHDH sequences, however, variations of both motifs might be present in hithertho undescribed enzymes with HHDH activities. With the combination of both restrictive HHDH-specific sequence patterns, it was possible to precisely identify true HHDH sequences present in public databases from the vast number of similar SDR sequences. These analyses represent a critically important effort to functionally characterize the enormous number of predicted genes that have unknown or incorrectly annotated function [223, 225]. Especially sequence data from metagenomic origins can be a viable source for novel biocatalysts but our results reflect that annotated sequences should only be utilized after careful inspection. Eight out of the nine HHDH sequences from

!-) *     environmental DNA sources were apparently annotated with likely incorrect start codons. Evidently, automated gene annotation algorithms in use seem to require further development and optimization – especially for challenging environmental sequence data which lack host specific translation initiation signals such as 16S rRNA sequences. Of all novel ( Table 1 ) and putative novel ( Table 3 ) HHDHs, only the coding sequence of HheA4 appeared to be part of a degradative operon since it was flanked by an epoxide gene upstream and a glycerol kinase gene downstream. The concerted activity of all three enzymes allows for the utilization of compounds such as 3-chloro-1,2-propanediol via glycidol and eventually glycerol – a pathway which has been proposed for HHDH-containing Arthrobacter sp. strain AD2 and Agrobacterium radiobacter AD1 [16]. For the latter strain, the epoxide hydrolase gene echA was cloned [202] and the resulting enzyme exhibited only 31% sequence identity with the epoxide hydrolase from Arthrobacter sp. JBH1, a value similar to the identity observed for the respective HHDHs (33%). Regarding the substrate specificity of each novel HHDH towards the small aliphatic substrate 1 in comparison to the larger aromatic substrate 2, a first conclusion might be drawn according to Figure 6 . Most HHDHs converted both tested chloroalcohols but enzymes of the D- and E-type exhibited a relative preference towards the smaller aliphatic substrate 1 in comparison to the aromatic substrate 2. In contrast, a larger portion of A- and B-type HHDHs showed a significantly higher relative activity in the conversion of aromatic chloroalcohol 2 than for aliphatic chloroalcohol 1. Only for HheB4, HheB6, and HheF, the overall observed activities on 1 and 2 are too low to draw any conclusion on their substrate preferences. In case of HheB4 and HheB6, only very low amounts of soluble protein were obtained upon heterologous expression in E. coli explaining their low activities ( Figure 5 ). However, in the case of HheF, a significant band of soluble protein can be observed on SDS-PAGE. Despite its rather low activity on chloroalcohol 1, HheF exhibited high activity in reactions using bromoalcohol 3 (data not shown). Hence, HheF seems to rather prefer bromo- over chloro- substituted haloalcohol substrates. Instead of the traditional classification of HHDHs based on their biochemical and sequence similarities [9], classification of HHDH family enzymes was determined by phylogenetic methods for this study. This approach is especially advantageous in regard of exponentially growing public databases in which more and more members of this enzyme class will certainly become available over time. Moreover, this objective means of classification has been used, for example, for the unrelated but similarly diverse haloalkane dehalogenase enzyme family [226]. While our phylogenetic classification efforts were consistent with the previous HHDH classification [9], several novel HHDHs did not cluster with previous subtypes but therefore had to be grouped into four additional enzyme subtypes. Due to the still low number number of HHDH enzymes, however, phylogenetic relationships of the HHDH enzyme family might be affected by future discoveries of additional family members. Previously, it has been discussed that HHDH enzymes segregated very early from SDR enzymes due to generally observed low sequence identities [54, 58]. Indeed, our phylogenetic analyses suggest that HHDH enzymes are only distantly related to highly homologous SDR enzymes. As measured by branch lengths from the root of SDR enzymes, the phylogenetic distance for any HHDH enzyme is larger than 0.7 amino acid exchanges per residue ( Figure 10 ). To put this distance into perspective: a recent analysis of DHRS4 and its paralog, the

!-+ (   dehydrogenase/reductase SDR family member 2 (DHRS2), concluded that both human genes diverged from each other before formation of the mammalian clade [227]. Here, the phylogenetic distance is only 0.2 amino acid exchanges per residue since the event of divergence of both SDR enzymes (not shown). Hence, the large phylogenetic distances to any SDR homolog that we have observed for all known and newly identified HHDHs indicate that previous assumptions correctly reflect the evolutionary path of HHDH enzymes.

(    Public sequence databases hold an immense treasure of biotechnologically relevant enzymes. In the past, many biotechnologically important enzymes have been successfully identified through in silico enzyme discovery. In contrast to enzyme classes which can be found in many different organisms, so far, all of the few previously known HHDH enzymes have been found in very few species obtained only after microbial enrichment techniques [188]. HHDHs constitute only a minute fraction of the SDR enzyme superfamily and, on average, only one HHDH enzyme sequence can be found among more than 10 6 of GenBank’s non-redundant protein sequences. Despite these facts, we could show that it is still feasible to extract such rare enzymes from sequence databases after thorough sequence analysis and identification of exclusive sequence motifs. With the help of the conserved sequence fingerprints T-x(4)-[FY]-x-G and S-x(12)-Y-x(3)-R, it will be much faster and highly effective to identify novel HHDH enzymes in the future in comparison to more time consuming classical microbiology and molecular biology approaches. Especially as an answer to the implications of advancing sequencing technologies, effective in silico methods for the discovery of novel enzymes will become more and more important. Ultimately, we are convinced that the novel halohydrin dehalogenases will enable interesting applications enriching the synthetic chemist’s toolbox.

   Database mining. The sequences of HheA, HheB, and HheC ( Table 1 ) were used as queries for blastp searches [208] of the nr database of GenBank (release 195) [228]. For each query, 20,000 sequences were retrieved and used together with HheA, HheB, and HheC for the construction of multiple sequence alignments (MSAs) with MAFFT (FFT-NS-2) [229]. In the resulting alignments, sequences were dismissed if they possessed the typical Ser-Tyr-Lys catalytic residues of SDR enzymes which aligned with the Ser-Tyr-Arg catalytic triads of HheA/HheA2 (S134-Y148-R151), HheB/HheB2 (S127-Y139-R143), and HheC (S132-Y145- R149). After removal of those putative SDR sequences, the reduced sequence pool of remaining sequences was re-aligned using MAFFT (FFT-NS-2). Afterwards, only sequences with a catalytic triad of Ser-Tyr-Arg, also present in known HHDH enzymes, were selected for construction of a new MAFFT (L-INS-i) [230] MSA. From the latter alignment, sequences were only considered to be putative novel HHDHs if they possessed an aromatic Phe or Tyr which aligned with F12, Y27 and F12 from HheA, HheB and HheC, respectively. All putative novel HHDH sequences were used as queries in subsequent search routines which consisted of blastp searches, construction of MSAs and their inspection for the presence of the typical HHDH catalytic triad in combination with the conserved aromatic residue close

!- *     to the N-terminus as outlined above. This strategy was continued until no additional putative HHDH sequences could be identified in the nr database with the protocol specified. Similarly, the env_nr database of GenBank (release 195) was searched for putative novel HHDH sequences by using the known and putative novel HHDH sequences as blastp queries. Again, up to 20,000 sequences per query were retrieved with blastp and an expect threshold (E) of 10. Before MSA construction for identification of conserved HHDH sites as outlined above, env_nr sequences shorter than 180 residues (80% of HheE5) were removed from the environmental sequence pool. Each complete nucleotide record per respective HHDH from metagenomic origin was taxonomically classified via the Naïve Bayes Classification (NBC) tool webserver [231]. Gene synthesis and cloning. Prior to gene synthesis, the deposited protein sequences of all putative novel HHDHs were first inspected for the presence of an alternative ATG start codon which was also preceded by a putative Shine Dalgarno (SD) sequence downstream of the annotated transcription start. In case the shorter protein sequences also contained the conserved aromatic Phe or Tyr residue (see above), then these corrected protein sequences were used as basis for gene synthesis, activity tests and phylogenetic analysis. Synthetic genes were ordered from Life Technologies (Darmstadt, Germany) after back translation of the curated protein sequences and codon optimization for E. coli with GeneOptimizer [232]. The synthetic genes were excised from received plasmids by restriction with NdeI and either HindIII or XhoI followed by ligation with T4 DNA (all DNA modifying enzymes from New England Biolabs, Frankfurt, Germany) into linearized expression vector pET-28a (Merck, Darmstadt, Germany). After transformation into E. coli DH5 α (Life Technologies), recombinant plasmid DNA was prepared with the NucleoSpin Plasmid kit (Macherey Nagel, Düren, Germany) and was sent for sequencing at GATC Biotech (Konstanz, Germany). Expression and activity assays. Expression plasmids containing the different HHDH genes were either transformed into E. coli BL21(DE3) (Life Technologies) or C43(DE3) (Lucigen Corporation, Middleton, WI, USA) for heterologous protein expression. Each 20 mL TB medium containing 50 mg L -1 kanamycin and 0.2 mM IPTG were inoculated with 10% (v/v) of a respective E. coli pre-culture and incubated at 20, 30, or 37°C for 7 to 24 h. Afterwards, cultures were centrifuged and pellets were stored at -20°C. For reference, known enzymes HheA2, HheB2, and HheC [9] were recombinantly expressed in E. coli Top10 (Life Technologies) from vector pBAD (Life Technologies) in TB medium with 100 mg L -1 ampicillin and 0.02% (w/v) L-arabinose. To prepare cell free extracts (CFEs), cell pellets were resuspended in 1.2 mL 25 mM Tris SO 4 buffer, pH 7.5, and disrupted by sonication. SDS- PAGE and Western blot analyses of CFEs were performed to analyze heterologous expression of the different HHDHs. In SDS-PAGE, about 10 µg of total protein present in CFEs were separated in 12% polyacrylamide gels for 45 min at 200 V. In Western blots, His-tagged proteins were detected using the Ni-NTA HRP Conjugate (Qiagen, Hilden, Germany) and the Pierce ECL Western blotting substrate (Thermo Fisher Scientific, Rockford, IL USA) according to the manufacturers’ instructions.

After centrifugation, each 200 µL of these CFEs were added to 600 µL 25 mM Tris SO 4 buffer, pH 7.5, containing one of the substrates 1,3-dichloro-2-propanol ( 1), 2-chloro- phenylethanol ( 2), or 1,3-dibromo-2-propanol ( 3) in 10 mM final concentration for activity assays. Reactions were incubated at 30°C and each 50 µL samples were taken after 5, 20, and

!-, (  

60 min incubation to monitor dehalogenase activity using the halide release assay as described elsewhere [177]. In addition after 60 min, remaining 650 µL per reaction were extracted once with each 600 µL methyl tert -butyl ether containing 0.1% (v/v) dodecane as internal standard. Organic extracts were dried over magnesium sulfate and analyzed on a GC2010 gas chromatograph (Shimadzu, Duisburg, Germany) equipped with a Supreme 5ms column (CS Chromatography Service, Langerwehe, Germany). GC analysis of reactions containing substrates 1 and 3 was carried out with a temperature gradient starting at 40°C for 1 min, heating with 10°C min -1 to 120°C, and afterwards with 20°C min -1 to 300°C. In case of reactions containing substrate 2, the temperature gradient started at 80°C for 1 min, heating with 10°C min -1 to 160°C and finally with 20°C min -1 to 300°C. Substrates eluted at 6.6 min ( 1), 8.6 min ( 2), and 9.4 min ( 3) whereas corresponding products were detected at 3.7 min (epichlorohydrin, 4), 5.8 min (styrene oxide, 5), and 4.8 min (epibromohydrin, 6), respectively. Product formation was monitored based on relative peak areas and quantified with the help of standard curves. All chemicals were obtained from Sigma-Aldrich (Steinheim, Germany) at the highest available purity. HHDH fingerprinting. Sequence logos were generated with WebLogo [233] either on the basis of a MAFFT MSA with all HHDH or with 718 homologous SDR sequences (see below). For HheC, the precomputed BLAST link (BLINK) at the National Center for Biotechnology Information website was used to collect sequences with an E cutoff of 100 from the UniProtKB/Swiss-Prot [234] and PDB [235] databases. Patterns T-x(4)-[FY]-x-G, S-x(12)- Y-x(3)-R, or combinations thereof were used as seeds in PHI-BLAST [213] searches of the GenBank nr database (release 202). Here, HheA, HheB, HheC, HheD, HheE, HheF, and HheG were used as queries with an E threshold of 10 and the resulting alignments were inspected for sequences with both conserved HHDH sequence features as outlined above. From the PHI- BLAST results of the combined pattern searches, PSI-BLAST searches [208] were initiated after selecting all identified HHDHs for building the position-specific scoring matrix (PSSM). Phylogenetic analysis. MAFFT (L-INS-i) and webPRANK [236] MSAs with all previously known and novel HHDH sequences were used for minimum evolution and maximum likelihood tree reconstruction. In detail, minimum evolution trees were generated by FastME (version 2.07) [237] with NNI and SPR tree topology refinement from a PROTDIST distance matrix (program included in the PHYLIP package, version 3.695) [238] using the JTT substitution matrix. For maximum likelihood trees, PhyML (version 3.1) [239] was used with the WAG substitution matrix, NNI and SPR tree refinement and 5 random starting trees. Trees were rooted with the help of SDR enzymes DHRS4 (UniProt ID: Q9BTZ2) and FabG (UniProt ID: P55336) which were included in MSAs and tree reconstruction as outgroup. Phylogenetic trees were assessed for their reliabilities by bootstrap analyses with 100 replicas for PhyML or 1000 replicas for FastME trees. For phylogenetic placement of HHDH sequences in respect to other SDR sequences, each 100 of the most homologous SDR sequences were collected from BLINK results for a diverse selection of HHDHs (HheA, HheA3, HheA5, HheB, HheB5, HheC, HheD, HheD2, HheD5, HheE, HheE2, HheE3, HheE5, HheF and HheG). In total, 718 SDR sequences which possessed a Ser-Tyr-Lys catalytic triad were used for the construction of a FastME minimum evolution tree based on a MAFFT MSA as outlined above and the resulting tree was assessed for its reliability with 100 bootstrap replicas.

!!-