Molecular evolutionary analysis of the SARS-CoV-2 through the mutation analysis of Spike, Envelope and RdRp proteins

COVID-19 pandemic, caused by the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), was declared in 2020 by the World Health Organization. New mutations have been identified, leading to various variants of this virus, including Alpha, Beta, Gamma, Delta, and Omicron, which are classified as variants of concern (VOCs) and have raised considerable concerns for global public health. Such constant spread and changes in the genome of the virus require continuous monitoring. This research focuses on the evolution of SARS-CoV-2 through a detailed presentation of the viral genome, protein structure and interpretation, with the presentation of phylogenetic characteristics and patterns. We obtained the sequence data from the European region focusing on the S, E, and RdRp proteins from the publicly available NCBI database. We next used the MEGA11 package to generate the multiple sequence alignments and create phylogenetic trees. The SWISS-MODEL server was connected to the Protein Data Bank to obtained tertiary structure images of all the proteins presented in the paper. Stability studies of obtained mutations were performed via MUpro online tool. The results indicate a substantial impact of the Omicron variant relative to others, particularly concerning the alterations and mutations observed in the spike (S) protein, which is crucial in the infection process.


Introduction
At the end of 2019, Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) rapidly spread globally, resulting in two million deaths within the first 12 months.During this period, a significant health crisis emerged, leading to widespread adverse consequences on a global scale (Kapoor et al., 2023;Yang and Rao, 2021).In the last twenty years, three coronaviruses (CoVs) have been identified, and the SARS-CoV-2 pandemic has had especially detrimental consequences (Sallard et al., 2021).The reservoir for the coronavirus evolution is a group of host mammals consisting of about 500 species (Cong et al., 2020;Sallard et al., 2021;Zhao et al., 2022).Important for the molecular evolution of these viruses, it is worth noting that the genomes of all members of the Coronavirus family are relatively long (Gomari et al., 2023).The first sequence of the SARS-CoV-2 virus was reported in GenBank on 11th January 2020 (MN908947.3)(Rahimi et al., 2021).GenBank (https://www.ncbi.nlm.nih.gov/sars-cov-2/) has collected 8,723,875 nucleotide sequences (accessed 8th February 2024).Genome sequencing has been systematically implemented since the onset of the pandemic and remains critically important due to the continuous emergence of new variants and strains.Consistent with this, proper data collection, processing, and adequate genome analysis methods are required (Kandeel et al., 2021).The phylogenetic approach enables the differentiation and identification of a range of phenomena, including different viral variants (Attwood et al., 2022).The classification of SARS-CoV-2 into Variants of Interest (VOIs) and Variants of Concern (VOCs) provides a straightforward method for differentiation.VOIs are primarily characterized by mutations that alter the virus's properties (Figure 1).VOCs are generally monitored more closely because they significantly change the severity of the virus, its spread, and, therefore, the effectiveness of vaccines (Souza et al., 2022).The evolution and progression of a pandemic are influenced by the interplay between the pathogen, the host, and the environment in which they (co-)exist (Pinato et al., 2022).In this research, we studied the SARS-CoV-2 describing the entire genome with the analysis of structural, non-structural and accessory proteins using in silico methods and detailed literature search.For phylogenetic analysis, we selected two structural proteins, Spike and Envelope proteins, as well as RdRp, a non-structural protein with a key role in virus replication.We chose the Spike protein because it has the highest mutation rate, resulting in the questionable effectiveness of vaccines that was discussed in the scientific community repeatedly with each new VOC emergence (Akkız, 2022;Hamdy et al., 2023).RdRp was chosen because many PCR tests target this gene, so changes can be reflected in the occurrence of false negative tests (Valadan et al., 2022).Studies related to viral pathogenesis, infection and function of E protein are scarce, when compared to other proteins that are in more detail researched and explained (Zhou et al., 2023).Additionally, this gene is not known for a high mutation rate, thus representing the lower end of the mutation rate spectrum in the SARS-CoV-2 genome.This comprehensive genomic approach can enhance our understanding of the evolutionary processes driven by various mutations, as such knowledge is applicable to pandemic monitoring, the design of detection tests, and pharmacological applications.

Material and methods
The data filters used in the NCBI Virus database search were set as: the names of proteins (S, E and RdRp), geographic region: Europe, isolation source: oronasopharynx, and nucleotide completeness: protein.For the Pango lineage filter, we chose VOCs, with the addition of XBB.1.5(nicknamed Kraken) that was contributing to increased number of reported cases, causing 49.1% of COVID-19 cases in January 2023 (Abdolreza et al., 2024;Scarpa et al., 2023), as well as BQ.1 or Cerberus, the BA.5 sublineage, due to its wide prevalence, especially in January 2023 (Scarpa et al., 2023).Sample collection date was considered as the reporting date of the genome.Samples were chosen from the European region from the three periods: 1st September 2020 -1st April 2021 31st August 2021 -1st April 2022 31st August 2022 -1st April 2023 We originally designed these periods because of the dynamics of spread of VOCs and peak of infections, in addition to the autumn and winter periods also being characteristic for flu and other respiratory infections.
A total of 121 nucleotide sequences were obtained for E protein, including the RefSeq Wuhan sequence, while for the analysis of S protein, 138 nucleotide sequences (including the RefSeq from Wuhan) were used.In collecting data on the RdRp molecule, due to specificities of the polypeptide processing and generation of mature protein, we used amino acid sequences and obtained 42 available samples, including the RefSeq entry.Genomes or sequences downloaded in FASTA file format were used in the Molecular Evolutionary Genetics Analysis Version 11 (MEGA11) package for alignment using default parameters (Kumar et al., 2018) for MSA with the ClustalW implementation (Gap Opening penalty 10.00 and Gap Extension Penalty 0.20 for multiple alignment).Protein sequences were aligned with the reference SARS-CoV-2 protein sequence from Wuhan (Wuhan -Hu-1 genome, GenBank accession NC_045512), using the E protein information on https://www.ncbi.nlm.nih.gov/protein/YP_009724392.1,and S protein on https://www.ncbi.nlm.nih.gov/protein/179631859.Considering the manual selection of sequences and the absence of frequent or longer gaps in the viral gene or protein sequences, changes in alignment parameters did not produce significant differences in terms of genetic distances calculated for the MSA.When designing the phylogenetic tree, we set the bootstrap value to 1000 (Hall, 2013), using the neighbor-joining (NJ) method to design a cladogram.We have deployed UPGMA tree building method, as another distance-based alternative to the NJ method and no significant differences in tree topology were observed.In addition, more computationally intensive methods, including maximum parsimony and maximum likelihood were implemented, producing congruent topology to the one presented here.Combined with bootstrap values over 70%, these results confirmed well-supported NJ tree topology, which remained the method of choice considering its simplicity, speed and low computational intensity.The tree in Newick format from the MEGA11 program was further edited and visualized by iTOL (https://itol.embl.de/)(Letunic and Bork, 2016).In order to perform protein structure visualization, the coordinates were downloaded directly from the Protein Data Bank (PDB), connected to the SWISS-MODEL server instead of performing de novo modeling of the protein structure, considering widely available high-quality imaging and annotation data for the SARS-CoV-2 proteome.We used UCSF Chimera to visualize selected molecules to show the structure of the proteins that make up the virus.The following structures of SARS-CoV-2 three-dimensional molecules with their PDB numbers were retrieved from the RCSB Protein Data Bank (https://www.rcsb.org/) in PDB format (Berman et al., 2000): Envelope protein (PDB ID: 7K3G), Spike protein (PDB ID: 7QUS), Membrane protein (PDB ID: 7VGS), N protein (PDB ID: 6YUN), Nsp1 (PDB ID: 8AZ9), Nsp2 (PDB ID: 7MSW), Nsp3 (PDB ID: 6WX4), Nsp4 (PDB ID: 7LZY), Nsp5 dimer (PDB ID: 6M2N), Nsp7 (PDB ID: 7LHQ), Nsp8 (PDB ID: 6YHU), Nsp9 (PDB ID: 6W4B), Nsp10 (PDB ID: 8OV4), Nsp12 (PDB ID: 6YYT), Nsp13 (PDB ID: 7NNG), Nsp14 (PDB ID:7TW9), Nsp15 (PDB ID: 6WLC), Nsp16 (PDB ID: 8OV4), and Protein complex Nsp10 and Nsp16 (PDB ID: 8OV4).All of them were visualized by cartoon presentation using UCSF Chimera 1.16 (Pettersen et al., 2004).In order to obtain information on the effect of detected genetic variants on the protein stability, we used online tool MUpro (Cheng et al., 2006), whereby plain reference protein sequence of each S, E and RdRp protein was entered.We indicating all previously detected variable positions and the exact amino acid change, and obtained the results on the effect of specified changes.

ORF1ab and its polyproteins
The nonstructural proteins (Figure 2) of the virus (Nsp1-16) are created after proteolysis of the polypeptides pp1a and pp1ab by viral protease (Frazier et al., 2022).
Non-structural proteins of SARS-CoV-2 play a pivotal role by being involved in viral replication processes and replication transcription complex (Yan et al., 2020; Supplementary Table 1), and through their role in suppression of immune system or feedback of the host organism (Prasada et al. 2023).

Structural proteins of SARS-CoV-2
Genes that encode for structural proteins (Figure 3) are found on the 3'-UTR, along with eight accessory genes (Naqvi et al., 2020).SARS-CoV-2 Spike protein is in the group of fusion viral proteins (Santopolo et al., 2021) that has a vital role in the infection phase (Brintnell et al., 2021) and consists of two functional units labeled S1 and S2 (Kumavath et al., 2021).This protein plays a key role in viral entry into the host cells (Ghorbani et al., 2021).Glycosylation, phosphorylation, and acylation are some of the post-translational modifications (PTM) usually found on the Spike protein, following successful host infection (Cheng et al., 2023).ACE2 is the primary receptor for Spike protein; their joining enables endocytosis or virus fusion inside the cell.In addition to the ACE2 receptor, other host proteins can serve as a receptor for coupling S proteins to the host, such as Neurophilin-1 (Kumavath et al., 2021;Yevsieieva et al., 2023).The RBD located on the Spike protein enables the recognition of hACE2, ensuring the penetration of the virus into the host (Yevsieieva et al., 2023).Within the RBD is a characteristic N501Y mutation that facilitates binding to the host, as well as E484K that was identified in Europe and America (Akkız, 2022).
ORF4 gene encodes for E protein (Yoshimoto, 2020), that contains from 76 to 109 amino acids, ranging from 8.4 to 12 kDa in size (Schoeman and Fielding, 2019;Zhou et al., 2023).E protein is an integral and short hydrophilic protein (Schoeman and Fielding, 2019), involved in the life cycle of SARS-CoV-2 from several perspectives (Cao et al., 2021).E protein belongs to the group of viroporins, that are regulating microenvironment of the host cell, including the pH and ion concentration (Bai et al., 2022;Cao et al., 2021;Schoeman and Fielding, 2019).It is also involved in viral assembly, release and pathogenesis (Yoshimoto, 2020).Through experimental efforts, it was reported that E protein contains parts subject to glycosylation (N48 and N66 parts) and it is known that through this process, the Spike protein deftly resists "obstacles" and thereby preserves infectivity.The M protein shapes the viral appearance and is the most abundant viral protein (Schoeman and Fielding, 2019).It is encoded by ORF5 gene (Yoshimoto et al., 2020), contains 222 amino acids (Bai et al. 2022), and plays a role in binding to the nucleocapsid (Khan et al., 2020) and determining the shape of the viral envelope (Bai et al., 2022).Viral particle assembly is enhanced after succinylated N and M proteins.ORF9b gene encodes for Nucleocapsid phosphoprotein (N protein) (Cheng et al., 2023;Redondo et al., 2021), which is crucial for capsid formation (Schoeman and Fielding, 2019), in addition to being an antigen for virus detection (Rak et al., 2023).It comprises 419 amino acids (Bai et al., 2022) and is an antagonist of the host immune response (Yang and Rao, 2021).Back in 2020, it was already known that N parts might be detected in infected cells.This protein has a decisive role in the transfer of material to the progeny of the virus (Cong et al., 2020), but also promotes hyperinflammation and protects the viral genome (Bai et al., 2022) packed into a helical ribonucleoprotein formed by the N protein (Low et al., 2022).The N protein is less prone to mutations and its higher sequence conservation make it an excellent diagnostic target (Yan et al., 2022).Beta and Omicron strains contain a mutation on the N protein, a P13L/S substitution, which is unique to these strains.It is believed that this mutation, as well as P13L, ERS31-33del, D63G, R203K/M, G204R, and D377Y occurred because the virus escaped the immune response of the host (Rak et al., 2023).Stabilizing mutations characteristic of the N protein are S194L, D103Y, P13L, S197L, M234I, and S188L (Jacob et al., 2021).
The last third of the viral genome codes for structural and accessory proteins (Wilson et al., 2022), involved in viral pathogenesis (Redondo et al., 2021).The accessory proteins include ORF3a, ORF3b, ORF6, ORF7a, ORF7b, ORF8, ORF9b, ORF9c, and ORF10 (Zhou et al., 2022) Wolf et al., 2023).By 21st January, 2024, there were 774,394,829 confirmed cases of COVID-19 (https://covid19.who.int/),compared to 27th January 2023, when 752,517,552 cases of COVID-19 were confirmed, alluding that the changes are still present and that the statistics is of great importance in monitoring the virus even after the official declaration of the end of the pandemic.One of the reasons for such accelerated viral mutation accumulation is the lack of the viral genome proofreading activity of the virus, as well as potential effects on the overall protein stability following the appearance of the initial mutations (Papanikolaou et al., 2022).In this study, we analyzed the SARS-CoV-2 sequences of the S, E, and RdRp proteins, observed mutations and their phylogenetic relationships, as well as their effect on the protein stability.The results indicate evolutionary changes and provide information relevant for future monitoring and related research.MUpro analysis of mutation effect showed that the majority of observed changes act towards decreasing the protein stability (Table 1).Mutations were detected using the MEGA11 package, and their correctness was verified through literature review, followed by comparisons and commentary as appropriate.

Spike protein
In this study, we detected 95 Spike glycoprotein mutations (Supplementary Table 2).The available evidence suggests that significant Spike protein mutations include E484K, K417T/N, N501Y, and D614G (Giovanetti et al. 2021;Mansbach et al. 2021).The N501Y mutation, located on the RBD, enhances the coupling of the ACE2 receptor and undoubtedly facilitates its entry into the host (John et al., 2021).This mutation is present in all VOCs except the Delta variant (Aleem, 2022), reducing vaccine efficacy (Saifi et al., 2022).Further evidence supports that N501Y and D614G mutations combined increase viral replication rate (Akkız, 2022).N501Y also corroborates the highest binding affinity of RBD in all VOCs, followed by E484K (Shahhosseini et al., 2021).
Convergent substitutions (R346T, K444T, L452R, N460K, and F486V) are characteristic of the BQ.1 variant (Ito et al., 2023).We found the mutation K444T in BQ.1, BQ.1.1,BQ.1.1.18,as well as R346T and L452R in Delta samples (Supplementary Table 2).The S375F mutation characterizes the Omicron BA.1 variant and thus affects the speed of Omicron spread due to increased binding affinity to hACE2.Mutations S371L, S373P, and S375F were not detected in earlier variants before the appearance of BA.1 (Kimura et al. 2022).In this study, we detected S371L in Omicron samples, in addition to the rare mutation Q493R which we found in Omicron and Gamma S protein sequences.This position is critical because of its role in interaction with ACE2, that is, increased binding affinity to hACE2 in case of mutant amino acid being present (Alkhatib et al., 2022).Specific mutants circulating worldwide (such as V367F) may affect splicing (Gusev et al. 2022; however, we did not detect this mutation. According to Varghese and Stanislaus (2023), N501Y, K417N, D614G, and T478K are the base mutations characteristic of S protein Omicron variants.Our results indicated similarity compared to their findings regarding Omicron mutations characteristic for S protein.Delta Spike mutations we detected, in line with the literature, are L452R and D614G, along with the Gamma mutations K417T, E484K, N501Y, and D614G.In addition, the P681H characteristic for the Alpha variant, was found in the Gamma variant in our research.Deletions that affect the viral dynamics by increased infectivity are primarily characteristic of the S1 non-RBD region, namely: ΔN69/ΔV70/Δ156/Δ157/ΔΥ144/ΔL242/ΔA243/Δ L244 (Papanikolaou et al., 2022).Our evidence supports that mentioned mutations are present in our data except for ΔL242 and ΔA243.Further evidence (see Supplementary Table 2), supports that Y144del was detected in Omicron lineages.The analysis included protein sequences of the S protein from 137 samples, along with the reference genome from Wuhan (n=138) downloaded from the NCBI database, whereby the phylogenetic tree was rooted at the RefSeq from Wuhan.Our goal was to highlight the differences between the VOCs and their distribution within the phylogenetic tree, revealing clusters that emerged due to various mutations, particularly numerous in the S protein.Notably, Omicron forms a monophyletic clade among the other variants in this analysis.The genomic sequence of the SARS-CoV-2 S protein is of significant interest due to its functional importance and its high susceptibility to mutations, which greatly influence the mosaic properties of the virus.

Envelope protein
We reported four mutations (Supplementary Table 3) in the sequences included in the analysis.Variant P71L was present in a total of nine samples of the Beta lineage, while V62F was observed in the B.1.617.2 lineage.According to the global research, T9I and P71L mutations were identified as the most frequent E protein changes, followed by V62F, L21F/V and V58F (Abavisani et al. 2022).In our study, we did not detect L21F/V and V58F, while S50I was detected in our results, despite being not so often cited as an important E protein mutation.(red).Final tree was visualized using iTOL (Letunic & Bork, 2016).
Figure 5. Phylogenetic tree of the Envelope protein constructed by the NJ method with 1000 bootstrap iterations, colored by the encompassing VOCs, Alpha, Gamma and Delta in dark purple, Beta in lilac, and Omicron in yellow.The final tree was visualized using iTOL (Letunic & Bork, 2016).Figure 6.NJ-based phylogenetic tree of RdRp, supported by 1000 bootstrap iterations.The tree is rooted with RefSeq entry (red highlight).The regions colored in green and blue indicate the samples where most mutations were found, including M196R, P323L, G671S, and C645G.Final tree was visualized using iTOL (Letunic and Bork, 2016).
The phylogenetic analysis of the E protein revealed that the Omicron variant has the highest total number of mutations compared to the other VOCs.

RdRp
The leading and characteristic mutations of RdRp reported for Omicron are A1892T, I189V, P314L, K38R, T492I, and V57V (Bansal and Kumar, 2022).Subsequent changes in RNA editing and RNA replication errors are the main reasons for the emergence of further mutations (Wang et al., 2023), such as P323L caused by the 14408C>T mutation on RdRp, which results in high mutation rates throughout the rest of the viral genome, despite being predicted to increase the protein stability.In addition, mutations A97V and A185V can alter the secondary structure of a protein (Rahimi et al., 2021).It was also reported that P323L and A185V could eventually affect the secondary structure of the Nsp12 protein (Vilar and Isom, 2021).The critical mutation for the B.1 lineage is P323L (P4715L) (Jacob et al., 2021), which is also the only RdRp mutation frequently reported in the literature which we also identified in this study.Three other mutation that we identified (Figure 6, Supplementary Table 4) are not reported in literature, but are predicted to decrease the protein stability (Table 1).

Discussion
Evolutionary changes and the overall development of the virus are monitored by the viral genome research (Souza et al., 2022).Omicron (B.1.1.529)contains a major part of its modifying mutations in the RBD region of S glycoprotein (Han et al., 2022).The Omicron variant was reported to include 32 known mutations on the Spike protein, such as 69-70 del, K417N, T478K, N501Y, and P681R (Akkız, 2022); 15 substitutions are also reported in another work (Han et al., 2022); and this variant is otherwise known for containing increased number of mutations when compared to previous variants (Akkız, 2022).The appearance of the XBB.1.5'Kraken' variant, suppressed previously common variants, mainly due to its increased affinity towards the hACE2 receptor, marked by the F486P mutation on the Spike protein (Parums, 2023).However, the results of the binding analysis of the XBB.1.5variant showed that it is not a highly threatening variant for global health, compared to the previous variants (Scarpa et al., 2023).EG.5 (Eris) was first reported by the WHO on 17th February 2023, and was declared VOI on 8th August 2023.Eris is a subvariant of Omicron, and during July and August of the same year, a more significant rise in the number of new infections was recorded mainly due to this variant.The uniqueness of the Eris subvariant is reflected in the F456L mutation (Parums, 2023).In addition, new subvariant of Eris, EG.5.1 was described with new S protein mutation Q52H (Bilasy et al., 2024).Later, in September 2023, a new variant, BA.2.86 (or Pirola), was recommended for monitoring and was described as more resistant to monoclonal antibodies (mAbs) having ~35 mutations distinct from XBB.1.5(Qu et al., 2023).BA.2.86 was also described as stable mutational variant, with higher affinity for receptor interactions, giving it a potential for increased severity and transmissibility (Roy, 2023).It was noted that the combination of N501Y, E484K, and K417N/T mutations could significantly affect the resistance and sensitivity to the currently designed therapy to control the infection (Akkız, 2022).All future approaches in the research of the new mutations are important because any failure can affect the research of new variants, leading to misinterpreting the findings or encountering unwanted challenges (Burns et al., 2022).Based on protein research, scientists can design novel drugs or proceed towards targeted therapy (van de Leemput and Han, 2021).Sharing data in science is crucial, especially during pandemics, as it forms the foundation for all health-related measures both locally and globally (Pan et al., 2021).S, E, and RdRp are essential proteins; their mutations have high information content, and our work aimed to present the diversity of identified mutations.E protein is known to be highly conserved in terms of its nucleotide and protein sequence; contrary, the S protein accumulates mutations much faster, while RdRp is an essential and complex protein.We did not include the Nsps in the analysis because the mutations are of low frequency and are persistent; they do not offer much information in terms of drug design, epidemics, and immunology.Also, accessory proteins need to be more researched, since available information is still speculative and under further testing and validation.Although E and RdRp proteins are known to mutate poorly, their mutations and the detection of new mutations significantly affects the detection of the SARS-CoV-2 virus since these two genes are common targets of PCR diagnostic tests.A rapid and accurate diagnostic strategy is the key link for managing the pandemic and monitoring the movement of the virus, which is a very important implication for future pandemics as well (Bilasy et al., 2024).

Conclusion
The results from our study provide evidence of the rapid evolution of the SARS-CoV-2 virus, particularly in certain genomic regions, which is significant for predicting and comparing virus variants.This approach is also used in forecasting the course of a pandemic, developing optimal immunization and treatment strategies, and monitoring animal reservoirs of the virus (Cheng et al., 2023).Our findings confirm that during the vaccination period, selective pressure favored the viral evolution, representing an adaptive response (Duerr et al., 2023;Rando et al., 2023).This led to a significant increase in the number of new variants and mutations detected.Consequently, while symptoms have become less severe, the virus continues to modify its genetic code to maintain its presence in the host population for an extended period.

Figure 2 .
Figure 2. Non-structural proteins of SARS-CoV-2 were obtained through the SWISS-MODEL project, whereby we utilized protein sequences and monitored the GMQE score.Visualization was performed using UCSF Chimera.The PDB ID of the protein in the image corresponds to the closest template used for modeling by connecting the SWISS-MODEL work directly to PDB.

Figure 3 .
Figure 3. Structural proteins of SARS-CoV-2 were generated using the SWISS-MODEL, whereby the PDB templates with the best GMQE score were used for modeling.

Figure 4 .
Figure 4. Distance-based NJ phylogenetic tree of the Spike protein sequences composed of all VOCs representing the evolutionary course of SARS-CoV-2 spread.Each of the variants is represented as a subclade.Tree was rooted at the RefSeq from Wuhan highlighted as red branch.The following branches are presenting different VOCs: Alpha (green), Beta (pale orange), Gamma (light blue), Delta (lilac) and Omicron(red).Final tree was visualized using iTOL(Letunic & Bork, 2016).