close
close

Regulatory sequence-based discovery of anti-defense genes in archaeal viruses

A consensus regulatory sequence precedes early expressed viral genes

To date, only four Acrs, AcrID1, AcrIIIB1, AcrIIIB2, AcrIII-1 and a putative Aca, Aca8, have been identified in archaeal viruses, and all are sparsely distributed (Supplementary Fig. 1)26,28,33,34. The acrID1, acrIIIB1, acrIIIB2 and aca8 genes are surrounded by several small, mostly single genes that lack functional annotation (Fig. 1A)27. Consequently, the guilt-by-association approach is barely applicable to Acr identification in archaeal viruses due to the small number of known Acrs and the preponderance of uncharacterized proteins (Supplementary Fig. 1).

Fig. 1: A regulatory sequence encompassing a strong promoter precedes Acr-related genes.
figure 1

A Illustration of all Icerudivirus SIRV2 genes at both termini including those encoding AcrIIIB1 (blue), AcrID1 homologs (red), Aca8 (yellow), virus-associated pyramid (vap, brown) and hypothetical proteins (gray). Arrow blocks with black lines represent SIRV2 early genes as determined based on their expression pattern (Supplementary Fig. 2A). Dashed arrows indicate the presence of a strong regulatory sequence upstream of the gene. B Consensus motif derived from MEME analysis of the regulatory sequences preceding 11 of 13 SIRV2 early genes. C Illustration of the general methodology employed here to identify early-genes/anti-defense genes, described in detail in the text. D MEME-RSAT prediction of potential ADGs in representative archaeal viral genomes SIFV, Hoswirudivirus ARV2, Mexirudivirus SMRV1 and SMV2. E Consensus sequence motifs from selected individual archaeal viruses.

In an effort to develop an alternative approach to identify Acrs, we made two intriguing observations. First, SIRV2 aca8 (gp01/gp54) and both known Acr genes, acrID1 (gp03) and acrIIIB1 (gp48) are expressed immediately post infection of Sulfolobus islandicus LAL14/1 (Supplementary Fig. 2A)35; second, most of the small hypothetical genes at the SIRV2 genomic termini, including acrID1 and acrIIIB1, share a highly similar regulatory sequence, in a sharp contrast to the very low sequence conservation of the nucleotide sequences of the genes themselves and limited similarity among the amino acid sequences of the encoded proteins (Fig. 1B). The identified consensus regulatory sequence encompasses a core promoter region, with a TATA-box (TTTAWATA) downstream of a TFB recognition element (BRE), a purine rich sequence reported previously as the strongest in binding the archaeal transcription factor TFB (the archaeal homolog of eukaryotic transcription factor TFIIB)36,37. Such promoter sequence is also found in aca8, albeit within the 5’ end coding sequence of the predicted ORF (83 aa). Multiple sequence alignment of Aca8 homologs showed that the actual translation initiation site is located 84 nucleotides downstream of the currently annotated start codon of gp01/gp54 (Supplementary Fig. 2B). Upon re-analysis of the SIRV2 transcriptome, we found that the shorter version of the gp01 transcript encoding 55 aa is the dominant species and only a negligible amount of reads corresponding to the longer transcript was detected in the late stages of viral infection38. Thus, the high-level transcription of aca8, acrID1, acrIIIB1 and an additional 8 genes among the 13 early genes of SIRV2 appears to be driven from the highly conserved promoter within the consensus regulatory sequence (Fig. 1B). The subsequent repression of the early genes that is observed 1 hour post infection is likely to be due to binding of the predicted wHTH anti-CRISPR associated protein Aca839 to the regulatory sequence (Supplementary Fig. 2C) (Fig. 1A), similar to the previously described bacterial Acas31,40. Moreover, clustering with acrs/aca suggests that the other early transcribed small genes also encode anti-defense proteins. Taken together, these observations prompted us to use the conserved regulatory sequences to predict new anti-defense genes.

Identification of putative ADGs guided by conserved regulatory sequences

We designed a pipeline for the identification of ADGs in genomes of archaeal viruses (Fig. 1C). Briefly, a 100 bp sequence upstream of the start codon was retrieved from viral single genes, i.e., genes that are not predicted to belong to an operon. Sequences from the same viral genome were aligned using MEME-suite followed by a matrix scan with RSAT to identify genes carrying highly conserved regulatory sequences with BRE and TATA-box characteristic of a strong promoter (see Methods). Only genes that met both of the following criteria were selected as putative ADGs: (1) the presence of a predicted strong promoter; (2) sharing a putative regulatory sequence surrounding the strong promoter in the same viral genome.

We first applied the analysis to aca8-carrying rudiviral genomes and predicted 127 ADGs from 17 rudiviruses. Subsequently using the same criteria, we screened the genomes of other members of Rudiviridae, i.e, those lacking aca8, all members of Lipothrixviridae, Bicaudaviridae and other unclassified viruses41,42,43,44,45,46, leading to the identification of an additional 105 putative ADGs. Among the 232 predicted ADGs, 175 showed no detectable sequence similarity to known archaeal Acrs or Aca8, or any other known proteins (Supplementary Data 1). As anticipated with this strategy, most genes encoding AcrIIIB1 and AcrID1 homologs were identified as ADGs in several viral genomes, clustered with other ADGs, as exemplified by Sulfolobus islandicus filamentous virus (SIFV) and Acidianus rod-shaped virus (Hoswirudivirus ARV2) (Fig. 1D). Notably, we predicted ADGs in viruses, such as Sulfolobales Mexican rudivirus 1 (Mexirudivirus SMRV1), that encode no homologs of known Acrs (Fig. 1D, E and Supplementary Fig. 3). Analysis of gene location in archaeal viruses shows that rudiviruses encode (putative) ADGs at both ends of their linear genomes, whereas lipothrixviral ADGs are predominantly located close to one end of the linear genomes (Fig. 1A, D), and ADGs of bicaudaviruses such as Sulfolobus monocaudaviruses SMV2, SMV3 and SMV4 are located randomly within their circular genomes (Supplementary Fig. 4A).

Similarly, we identified early gene regulatory sequences among 19 putative ADGs (including one aca8) within 6 Sulfolobales and Thermofilum MAGs (Metagenome-Assembled Genomes) (Fig. 2, Supplementary Data 1). Among these, a 535 aa ADG (MCI4409744.1) is a fusion of AcrIIIB1 and an unknown domain, with homologs in other MAGs, SMV2 and SIRV isolate V3 (Supplementary Fig. 5). MCI4409744.1 could be one of the largest Acrs, perhaps, endowed with activities against multiple host defense systems.

Fig. 2: ADGs in metagenome contigs.
figure 2

Sulfolobales and Thermoprotei contigs, likely to be of viral or plasmid origin, encode the predicted ADGs (shown in green). Two different regulatory sequences were identified from four Thermoprotei contigs and are depicted with dashed arrows upstream of the associated ADGs, in orange and in light blue, respectively. The corresponding consensus regulatory sequence is indicated within a frame of the same color.

Archaeal viruses lacking Aca8 homologs have diverse regulatory sequences associated with their putative ADGs (Fig. 1D, E and 2). Although the core promoter, composed of the TATA-box and the BRE element, is highly similar among the consensus sequences, the upstream and downstream regions vary significantly between viruses, which is especially evident within Mexirudivirus SMRV1 and SMV2 motifs (Fig. 1E, Supplementary Fig. 3). The diversity of the regulatory sequences and the absence of an Aca8 homolog together imply that these viruses encode different regulatory proteins. As an example, the consensus regulatory sequence identified in SMV2 and the Sulfolobales archaeon isolate (MAG: JAKMAB010000038.1) each contains inverted repeats overlapping the BRE element, likely a binding site for a transcriptional repressor (Fig. 1E and Fig. 2). Furthermore, both SMV2 and the metagenome sequence encode a small protein (SMV2 gp37 and L7H13_norf1, respectively) which is the only predicted ADG shared by these two genomes (Supplementary Data 1). A coiled-coil structure predicted by AlphaFold2, together with the presence of leucine residues spaced at specific intervals point to SMV2 gp37 homologs as leucine zipper proteins, likely binding the inverted repeats to repress transcription of ADGs (Supplementary Fig. 4B).

Some of the ADGs have homologous genes that are not associated with the conserved regulatory sequences, which are therefore not included as ADGs despite amino acid sequence homology with ADG proteins. Next, we calculated motif prevalence, that is, the fraction of homologous genes possessing the regulatory sequence, for all known and newly identified ADGs (Supplementary Data 1). The motif prevalence was 0.75, 0.51, and 0.13 for genes encoding AcrIIIB1, AcrID1 and AcrIII-1 homologs, respectively, and 1 for genes encoding Aca8 homologs, i.e., all aca8 genes in individual viruses are preceded by the corresponding virus-specific regulatory sequences. While AcrIIIB1 homologs are distributed among many members of two viral families, Lipothrixviridae and Rudiviridae, mostly as a single copy per genome (Supplementary Fig. 1), 26 of the 35 identified AcrID1 homologs are encoded in close proximity by only four members of Rudiviridae: Icerudivirus SIRV1 (SIRV1), Icerudivirus SIRV1 variant XX, SIRV2, and Icerudivirus SIRV3. This suggests a much higher level of gene duplication and possible functional diversification of AcrID1 which explains the difference in motif prevalence between the AcrIIIB1 and AcrID1 genes. The near lack of association of the ADG-motif with acrIII-1, which has one of the lowest motif prevalence among all anti-defense genes, correlates well with the inability of AcrIII-1 (SIRV2 gp37) to function as an Acr in a natural setting47.

Thus, we developed a new method for predicting ADGs in genomes of archaeal viruses. Using this approach, we identified 251 proteins, including known Acrs, as ADG or ADG-associated genes of which 193 were novel ADGs from 37 archaeal viruses and 6 MAGs. A comparison of the protein sequences encoded by the predicted ADGs to protein sequence databases yielded no homologs for most of them which is not surprising given the generally rapid evolution of anti-defense proteins. Nonetheless, in several of the ADG encoded proteins, helix-turn-helix (HTH) or ribbon-helix-helix (RHH) domains were detected suggesting that their mode of action involves DNA binding (Supplementary Data 1).

Early gene regulatory motif in Fuselloviruses

The high conservation of the early gene regulatory sequence motivated us to search for similar sequences among temperate archaeal viruses and host genomes. First, we noticed a high sequence identity between the viral early gene promoter and that of the CRISPR-Cas subtype I-A interference gene cluster among Sulfolobales that we studied previously (Supplementary Fig. 6). A RSAT matrix search within the S. islandicus LAL14/1 genome identified identical TATA-box and BRE-element in host housekeeping genes encoding the chromatin protein Cren7, S-layer protein SlaA, chaperonin GroEL, a transcriptional regulator, and 8 other host genes. All these genes are known or have the potential to be highly expressed (Supplementary Data 2)35,48, reinforcing the hypothesis that the ADG motifs (early gene motifs) contain a strong promoter.

Fuselloviridae is a family of temperate archaeal viruses with 20 members, SSV1 to SSV22. SSV genes generally form operons and were found to be transcribed into six dominant transcripts and two weak transcripts, referred to as T1-T849. Previously, the BRE sequence of the T6 operon, when placed upstream of TATA-box of the rRNA gene, has been shown to increase rRNA transcription in vitro by about 8 fold, approximately to the level of transcription observed with the T6 promoter36. The preferred binding site for TFB contains a consensus sequence of G at −6 and A at −3 positions upstream of the TATA-box start position36. We identified a consensus sequence with characteristics of a strong promoter among the regulatory sequences of T6 operons from all of the 20 SSVs (Fig. 3A). Together, these features suggest that ADGs are under the control of strong promoters that drive their high expression.

Fig. 3: Conservation of an early gene motif in Fuselloviruses.
figure 3

A Conservation of TATA-box and BRE element within the regulatory sequence of T6 transcript of 20 Sulfolobus spindle-shaped viruses. B Organization and diversity of the predicted ADGs in the loci encoded by T6 transcript of 20 sequenced SSVs. Homologs are color-coded and their function is expected.

From the 20 SSVs we identified 165 potential ADGs. All four acrID1 homologs in SSVs that were identified earlier are part of the T6 operons in SSV2, SSV3, and SSV6 (Fig. 3B)28. Clustering with known acr genes and the high expression level of T6 transcripts during viral infection strongly support the anti-defense function of the novel SSV ADGs.

In total, in all archaeal virus genomes analyzed (57 individual viral genomes and 6 metagenome-assembled genomes), we predicted 354 novel ADGs apart from the previously known Acrs and Acas (Supplemental Table 1). The majority of the ADGs (251) were encoded by single genes, i.e. each with its own promoter and regulatory sequence. Including previously characterized ADGs, all viruses analyzed here carried between 3 and 12 ADGs (Supplementary Fig. 7). Based on amino acid sequence similarity, the ADGs were classified into 116 protein families, ADG.01-ADG.89 encoded as single genes, mADG.01-10 from MAGs, and ADGSSV.01-ADG.SSV.17 in SSV operons. Among the 116 protein families, 42 contained 2 or more members and 74 were singletons. For the 42 non-singleton ADG families, most had an alignment coverage >= 80% with only 8 families falling below this threshold. For most of the families, the amino acid identity (pid) among the within ranged between 20 and 100%, but 11 families included members with a lower percentage identity. To put these findings into context, the AcrID1 family has a minimum coverage of 84.4% and a minimum pid of 7.3%, whereas the AcrIIIB1 family has a minimum coverage of 84.2% and a minimum pid of 20.2%. ADG families with extensive variation in coverage might include members with different domain architectures that could be involved in different anti-defense function.

A CRISPR-Cas subtype I-A inhibitor in SIFV2 virus

To validate our approach to ADG prediction, we attempted to experimentally identify an inhibitor(s) of subtype I-A CRISPR-Cas system (AcrIA) from the pool of the predicted ADGs (Supplementary Data 1). CRISPR-Cas I-A systems are widespread in archaea, but no AcrIA inhibiting CRISPR interference has been reported. We sought to identify such an Acr and, for this purpose, chose a member of the Lipothrixviridae family, Sulfolobus islandicus filamentous virus 2 (SIFV2) (KX467643). SIFV2 is a lytic virus that has not been extensively characterized. It encompasses a cluster of six genes associated with the strong early promoter within the consensus regulatory sequence (see below). Given the almost ubiquitous presence of CRISPR-Cas I-A in Sulfolobus, we hypothesize that SIFV2 encodes an AcrIA although in silico analysis revealed only an acrIIIB1 homolog (Fig. 4A).

Fig. 4: CRISPR-Cas subtype I-A inhibitor in SIFV2.
figure 4

A Predicted ADGs (top panel) and consensus early gene motif (bottom panel) in SIFV2. B Growth curves of S. islandicus LAL14/1 Δcas6(I-D) carrying either an empty plasmid (left panel) or a plasmid encoding SIFV2 gp15 (right panel). Cultures were either not infected (uninfected) or infected with the parental virus (SIRV2M), ΔacrIA (SIRV2MΔgp45gp47), ΔacrIIIB1 (SIRV2MΔgp48) and ΔacrIAΔacrIIIB1 (SIRV2MΔgp45gp48). Data shown are mean of three biological replicates, represented as mean ± SD. Source data are provided as a Source Data file. C Virus titer of infected cultures from B at 72 h post viral infection. PFU/ml: plaque forming units per ml of the culture supernatants. Data shown are mean of three biological replicates, represented as mean value ± SD. A two-tailed unpaired t-test of plaque forming data was used to calculate P values; *P < 0.05, ns – not significant; P = 0.7898, 0.0153, 0.4115, and 0.3373 (from left to right). Source data are provided as a Source Data file. D Spot assay of serially diluted virus samples on S. islandicus LAL14/1 Δcas6(I-D) carrying either an empty plasmid or a plasmid encoding SIFV2 gp15. Dilution fold is indicated on top. The image is representative of five independent experiments. Source data are provided as a Source Data file.

To screen the Acr activity of the six SIFV2 ORFs, gp09, gp10, gp14, gp15, gp16, and gp17, we cloned them individually into the SulfolobusE. coli shuttle vector pEXA under the control of the weak arabinose promoter, ParaS250. The plasmids were then transformed individually into the strain S. islandicus LAL 14/1 Δcas6(I-D) (hereafter referred to as Δcas6(I-D)) where deletion of the I-D cas6 gene resulted in inactivation of subtype III-B cmr-γ and subtype I-D whereas subtype I-A and subtype III-B cmr-ɑ remained functional33,51. The transformants were infected with four viruses with varied susceptibility to the subtype I-A and subtype III-B cmr-ɑ CRISPR-Cas systems. Among the four viruses, SIRV2M (parental strain) is resistant to both subtype III-B and subtype I-A CRISPR targeting28,33 and its knockout strain SIRV2MΔgp48acrIIIB1) lost resistance to type III-B targeting33. The deletion of a fragment encoding three genes (SIRV2 gp45-gp47, adjacent to the AcrIIIB1-coding gene gp48) resulted in a virus susceptible to subtype I-A CRISPR targeting which was therefore termed SIRV2MΔgp45-47acrIA)52. The fourth strain SIRV2MΔgp45-48acrIAΔacrIIIB1) lacks all four genes and is susceptible to both subtype III-B and subtype I-A CRISPR-Cas targeting.

A preliminary screening employing the weak arabinose promoter showed that among the six tested genes, only gp15 caused mild host growth retardation upon infection with ΔacrIA mutant, indicative of CRISPR-Cas inhibition and ensuing virus propagation (Supplementary Fig. 8). Subsequently, the Acr activity was further demonstrated using a strain expressing gp15 from a strong arabinose promoter, Δcas6(I-D) pgp15araS-SD. In comparison to the empty vector, pgp15araS-SD restored the infectivity of ΔacrIA as shown by the significant growth inhibition of the host. However, pgp15araS-SD failed to restore the infectivity of ΔacrIAΔacrIIIB1 which, in comparison to ΔacrIA, lacks acrIIIB1. Therefore, the inhibitory activity of gp15 is restricted to subtype I-A (Fig. 4B). Virus titers from the cultures were quantified to verify that growth retardation was indicative of virus propagation (Fig. 4C). Indeed, pgp15araS-SD enabled the propagation of ΔacrIA reaching a level similar to that of the parental virus. To further validate these results, we performed spot assays with serial dilutions of the four viruses on a lawn of host carrying either the empty vector or pgp15araS-SD. Except for the parental virus, propagation of the viruses was inhibited in the strain carrying the empty vector (Fig. 4D, left panel). In the strain carrying pgp15araS-SD (Fig. 4D, right panel), the infectivity of ΔacrIA was specifically restored by about four orders of magnitude. Taken together, these results demonstrate that SIFV2 gp15 is an inhibitor of the subtype I-A CRISPR-Cas system.

A viral ADG inhibits a host toxin-antitoxin system

While most of the 116 ADG families contain exclusively viral genes, ADG.17 (arCOG10132) and ADG.51 (arCOG03737) are not only conserved among viruses infecting diverse hosts, but also are homologous to some host proteins (Supplementary Data 1, Fig. 5A). Homologs of ADG.17 are encoded in the genomes of Usarudiviruses SIRV7, SIRV8, SIRV9 and SIRV10, and ATSV, SMV1, SMV3, SMV4 and show high similarity to a small protein encoded in several members of the order Sulfolobales (Fig. 5A and Supplementary Fig. 9A)44,53,54. The ADG.17 homologs, identified by PSI-BLAST and confirmed by AlphaFold255 structural modeling, are encoded in defense islands or integrated plasmids in archaeal genomes (Fig. 5A). A comparison of structural models suggests that these proteins are antitoxins of the Phd (prevents host death) family (Fig. 5B). These genes form convergent gene pairs with genes encoding toxins of the Doc (death on curing) family, known to be kinases inactivating elongation factor Tu56 (Fig. 5A). Phd and Doc jointly comprise a type II toxin-antitoxin (TA) system which has been well studied in bacteria and phages, but not in archaea57,58. As such, ADG.17 appears to be the antitoxin for the Doc toxin in archaea, which remains to be studied experimentally.

Fig. 5: ADG.17 (arCOG10132 domain protein) inhibits host Phd-Doc toxin-antitoxin system.
figure 5

A Conservation of arCOG10132 domain antitoxin (blue arrows) among archaeal viruses (single gene) and hosts (as part of a two component gene system, Phd-Doc). Doc genes are indicated by red arrows. B Alphafold2 predicted structure of the viral antitoxin homolog SMV1 gp44. C Structure of SMV1 gp44 (shown in yellow) superimposed on the predicted structure of Phd-Doc (SIL_0730-SIL_0731, shown in deep teal and green respectively). Residue Asp39 conserved among the antitoxin homologs, Lys81 and Arg82 conserved within the toxin homologs. D Transformation efficiency in S. islandicus of plasmids encoding Doc toxin (SiL_0731) or Phd antitoxin (SiL_0730), viral homolog of the Phd antitoxin (SMV1 gp44) or combinations of toxin and antitoxin including mutants carrying nonsense mutations generating premature stop codons (*). Data shown are mean of four biological replicates, represented as mean value (center line) ± SD. Logarithm of zero is not possible to plot, hence two values are not shown. A two-tailed unpaired t-test of the transformation data was used to calculate P values in relation to the transformation data of SiL_0731; ***P < 0.001, ns – not significant; P values are <0.0001 (SMV1gp44), <0.0001 (SiL_0730), <0.0001 (SMV1gp44/SiL_0731), <0.0001 (SiL_0730/SiL_0731), 0.1335 (SMV1gp44*/SiL_0731), 0.0671 (SiL_0730*/SiL_0731), <0.0001 (SMV1gp44/SiL_0731*) and <0.0001 (SiL_0730/SiL_0731*). Source data are provided as a Source Data file. E Pulldown of the untagged toxin (SIL_0731, lanes 4-8) with histidine tagged antitoxins, SMV1 gp44 (lanes 5-6) and SiL_0730 (lanes 7-8) as bait. Lane 1 – protein ladder. Uninduced and IPTG-induced SiL_0731 cultures are shown in lane 2 and lane 3, respectively. Representative image of two independent experiments, source data provided as a Source Data file.

To further validate our approach for ADG identification, we characterized one of the viral ADG.17 family proteins, SMV1 gp44. First, we performed detailed analysis of the toxin encoded in the host genome, SiL_0731, and identified a characteristic Fic domain motif HXFX(D/E)(A/G)N(G/K)R, an essential adenylation component59 (Supplementary Fig. 9B). The structures of the host toxin/antitoxin pair SiL_0730/SiL_0731 and SMV1 gp44/SiL_0731 complexes were modeled using the protein structure and complex prediction program, AlphaFold255. Considering the amino acids conserved between the homologs, we can conclude that inhibition of toxin adenylation by both the cognate antitoxin and the virally encoded homolog occurs through residue D39. This residue interacts with the toxin residues K81 and R82 that are part of the Fic domain motif (Fig. 5B, C and Supplementary Data 3).

Next, we constructed plasmids encoding the S. islandicus LAL14/1 Phd-Doc pair SiL_0730/SiL_0731, or either of the two genes, under the transcriptional control of identical arabinose promoters. A similar plasmid with SiL_0730 (Phd) substituted for SMV1 gp44 was also constructed. As expected, the plasmid encoding the toxin SiL_0731 alone showed a much lower (3 – 4 orders of magnitude) transformation efficiency compared to that of the plasmid encoding both SiL_0730 and SiL_0731. However, the plasmid carrying SiL_0731 and SMV1 gp44 showed a transformation efficiency comparable to that of the plasmid carrying the toxin-antitoxin pair, suggesting an inhibitory effect of the viral small protein on SiL_0730. Subsequently, to identify the nature of the antitoxin (RNA or protein), we introduced point mutations causing premature translation termination of SMV1 gp44 or SiL_0730. This resulted in a drop in transformation efficiency to levels similar to that of the plasmid encoding the toxin alone, indicating that a protein product of SMV1 gp44 or SiL_0730 is necessary to inhibit the toxicity of SiL_0731 (Fig. 5D).

To investigate a possible direct protein-protein interaction, we performed pull-down assays using histidine tagged SMV1 gp44 or SiL_0730 as the bait, and E. coli cell lysate containing heterologously expressed toxin SiL_0731 as prey (Fig. 5E, panels 2 and 3). The toxin copurified with the antitoxin in both cases (Fig. 5E, panels 5–8) confirming direct interaction between Doc and its antitoxin partners.

Homologs of ADG.51 are encoded by Hoswirudivirus ARV3 (ARV3) and Acidianus two-tailed virus 2 (ATV2), and like ADG.17, show high similarity to a host encoded small protein in Sulfolobales, e.g. SiRe_2374 (Supplementary Fig. 10A). SiRe_2374 is encoded in a putative operon together with SiRe_2373, a fusion protein consisting of an N-terminal AAA ATPase domain and a C-terminal PD-DExK superfamily nuclease domain (Supplementary Fig. 10A, Supplementary Data 4). ADG.51 and its homologs are predicted to be wHTH type transcription factors, AlphaFold analysis and a subsequent DALI revealed similarity to transcriptional regulatory proteins of Multiple antibiotic-resistance Repressor (MarR) family (Supplementary Data 4, Supplementary Fig. 10B)60. The host proteins homologous to ADG.51 consist of two wHTH domains, whereas the viral homologs contain only the N-terminal wHTH domain, with an additional a-helix. Co-localization of a DNA binding protein and a nuclease as part of a two-gene cassette is typical of type II toxin-antitoxin systems. Hence, we can hypothesize that SiRe_2374/SiRe_2373-like host gene pairs might be a novel type of anti-viral immunity and viruses encode ADG.51 proteins to antagonize the immunity.

Taken together, these results indicate that SMV1 gp44 and its homologs in other viruses can act as antitoxins, Phd, to inhibit a host toxin, Doc. Upon infection, the viral Phd antitoxin could efficiently inhibit a possible abortive infection system, Phd-Doc, encoded by their Sulfolobales hosts. Furthermore, homologs of another host protein, co-expressed with AAA+ ATPase/endonuclease domain containing protein and highly conserved among Sulfolobales, is found in ARV3 and ATV2 possibly acting as an antagonist of host anti-viral immunity.

Prompted by these findings we searched for other proteins encoded by viruses of Sulfolobales with significant similarity to host proteins, focusing on small proteins as potential antitoxins (see Methods for details). This search revealed 114 protein families with at least one hit to Sulfolobales genomes including ADG.17 and ADG.51 (Supplementary Data 5). We further examined their genomic context and identified two additional protein families that are consistently encoded next to other small proteins and potentially form toxin-antitoxin gene pairs (Supplementary Fig. 11). One of these families (arCOG07934) is represented in Usarudivirus SIRV8 (YP_009362697.1) and Usarudivirus SIRV9 (YP_009362574.1) and is distantly related to AbrB family of DNA binding proteins (Supplementary Fig. 11A), which function as antitoxins in many known toxin-antitoxin systems61,62. Furthermore, in several host genomes, proteins of this family are encoded next to RelE toxins (Supplementary Fig. 11A). Notably, knockout of arCOG07934 homolog (M164_2845) was found to be lethal in Sulfolobus islandicus63, most likely, due to the activation of the corresponding toxin. Thus, arCOG07934 proteins are confidently predicted to function as antitoxins of RelE toxins. The second example involves proteins of arCOG08091 which are encoded in SIRV1 and SIRV2, and in several Sulfolobales genomes encoded next to proteins from arCOG07288 (Supplementary Fig. 11B). Both proteins are specific for Sulfolobales and do not share any sequence or structural similarity with known proteins. Thus, we hypothesize that arCOG07288-arCOG08091 genes comprise a TA system in which arCOG07288 is the toxin and arCOG08091 is the antitoxin. Alphafold2 modeling of the complex between these proteins suggests that the N-terminal region of the predicted antitoxin forms a beta strand antiparallel to and packing against the C-terminal beta strand of the toxin (Supplementary Fig. 11B). Accordingly, the virus-encoded antitoxin homologs are predicted to inhibit the toxin during infection when the host antitoxin is degraded.