Project 5
Project 5: Identification of genetic regulatory transcriptional networks in smooth muscle cell plasticity
Tobias Marschall (HHU), Mete Civelek (UVA) and Clint Miller [UVA]
Background and preliminary work: Coronary artery disease (CAD) is the leading cause of death in the Western world 1. Heritability estimates for CAD vary between 40% to 70%, suggesting a strong genetic contribution to disease pathology 2. Smooth muscle cells (SMCs), which comprise the medial layer of arteries, play key roles in the integrity of the vessel wall, blood pressure regulation, and atherosclerosis initiation and development. Studies in mice and humans demonstrate that SMC can undergo extensive changes in phenotype in atherosclerosis, acquiring for example characteristics of macrophages and there is compelling evidence that SMCs can play beneficial or detrimental roles in lesion pathogenesis depending on the nature of their phenotypic plasticity 3-5. Thus, identifying the genetic determinants of SMC gene expression is crucial for understanding the biological significance of CAD-associated genetic variants functioning in SMCs and their phenotypic transitions. More specifically, we hypothesize that genetic variability and structural variants (SVs) in particular are major determinants of SMC gene expression and cellular transition states and thereby modulate potential for plastic remodeling.
Genome-wide Association Studies (GWAS) identified 321 loci for increased CAD risk 6-8, and we previously showed that 94% of CAD-associated genetic variants are in non-coding regions 9. While this suggests that sequence variations that affect gene expression account for a substantial fraction of the genetic susceptibility to CAD 10, the precise molecular mechanisms remain mostly unknown. Identifying these unknown mechanisms can potentially discover new treatments, but determining causal variation from GWAS signals is challenging. This particularly applies to non-coding variation. To address this, systematic gene expression studies performed in cells and tissues relevant to CAD in human populations are needed to pinpoint the regulatory mechanisms of disease susceptibility 11.
Besides the challenges of obtaining the appropriate samples where functional studies can be performed, the hunt for causal variants is also hampered because most GWAS are restricted to single nucleotide polymorphisms (SNPs), while other variants remain unascertained. This particularly applies to structural variants (SVs), defined as genomic differences of 50 base pairs and larger. A typical human genome harbors around 28,000 SVs 12 and even though much fewer in number, SVs account for more variable base pairs per genome than SNPs by virtue of their length 13. Dr. Marschall’s laboratory has extensive expertise in algorithm development for the analysis of SVs and has pioneered methods for haplotype-resolved genome assembly using long-read sequencing data as a tool for SV characterization 14. Interestingly, a sizeable subset of SVs discovered by this approach can be reliably genotyped from short-read data using our method Pangenie 15 in particular when combined with new pangenome-reference resources we developed as part of the Human Genome Reference Consortium (HPRC) 16, 17. Applying this method, we discovered 1,525 SV-associated expression quantitative trait loci (SV-eQTLs), including a case of a 1,069bp deletion with lead SV-eQTL status targeting the LIPI gene in lymphoblastoid cell lines, a GWAS disease locus for cardiac failure 18. Such initial findings highlight the need for and the potential of taking SVs into account when studying cardiovascular diseases. However, this study has been restricted to the Geovadis data set using gene expression data from lymphoblastoid cell lines 19. Hence, there is a need to expand such studies investigating the associations between SVs and gene expression in tissues relevant to CAD.
In preliminary studies, the Civelek lab isolated SMCs from the ascending aortas of a cohort of 151 multi-ethnic human heart transplant donors. To the best of our knowledge, this is the largest collection of vascular SMCs and a unique resource available to us. We used RNAseq to measure the expression levels of ~18,000 protein-coding and long non-coding genes in cell culture conditions representing quiescent and proliferative conditions. We obtained the genotypes of 6.3 million SNPs using genome-wide genotyping followed by imputation. We performed association mapping and identified ~6500 loci associated with gene expression. We then focused on the loci associated with CAD risk and identified 85 genes whose expression were associated with the CAD loci 20. These genes are predicted to play a role in SMCs and increase CAD risk, but deeper investigations on the causal genetic variants have been limited by the resolution of the microarray platform used for genotyping. In particular, SVs and indel variants have not been considered for any of these analyses.
Hypothesis: This project is motivated by the central hypothesis that the genetic risk of CAD is modulated through indels and structural variants. More specifically, we are going to test the following hypotheses:
We hypothesize that germline indel variants and SVs can “lead eQTLs”, that is, they can be more strongly associated with gene expression at CAD loci in SMCs than previously identified SNP-eQTLs.
We hypothesize that, in these cases, the SVs are the best candidates for causal variants and test this by analyzing epigenetic assays and doing functional CRISPR-based validations.
We hypothesize that SVs play a role specifically as enhancers/promoters of CAD genes in SMCs and will test the hypothesis that they modulate gene expression of transcription factors (TF) with regulatory roles in SMCs.
Work program: For the cohort of 151 multi-ethnic human heart transplant donors available at the Civelek lab, we will isolate DNA and perform whole-genome sequencing (WGS) on the Illumina Novaseq platform to 30x coverage at the West German Genome Center (WGGC), of which HHU is a part of. Primary data analysis will be run by the Marschall lab and will include standard processing workflows for small variant calling as well as specialized pangenome-based analyses of structural variation. Specifically, we will run Pangenie 15 for genome-wide genotyping of structural variation and Locityper 21. We will use Pangenie in conjunction with the latest pangenome reference available from the HPRC (where Dr. Marschall is part of the Steering Committee), and expect to be able to detect more than 20,000 SVs per sample 16. Locityper complements this analysis by providing access to complex but medically-relevant loci through targeted genotyping from WGS data 21. We will run haplotype phasing based on the resulting integrated call sets using SHAPEIT 22, 23. Phasing the sequencing data will allow us to identify which genetic variants have been inherited together from the individual’s parents and are physically linked together. It, therefore, provides a way to reconstruct the full sequence of the two haplotypes (maternal and paternal) of each chromosome. To validate the accuracy of our approach, we will perform long-read sequencing (Hifi) on the PacBio Revio platform to 30x coverage at the Biomedical Research Center (BMFZ) at HHU, which is part of the West German Genome Center (WGGC), for six samples, followed by genome assemblies with Hifiasm 24. For these six samples, long-read transcriptome data (PacBio Isoseq), short-read transcriptome data (Illumina NovaSeq), as well as epigenome profiling data (ATACseq, H3K27ac, H2BK20ac, H3K4Me3 CUT&RUN) are available in the Civelek lab, allowing for particularly deep characterization of these samples, including the analysis of new isoforms in a haplotype-resolved manner.
The availability of full haplotype context is important because two genetically linked variants, one located in a regulatory region, such as a promoter, and the other in an exon, provide a tool to predict the molecular mechanism of gene expression 25. To identify these mechanisms, we will use data from the Civelek lab by culturing the cells in two conditions representing quiescent and proliferative states of SMCs. The distinct molecular mechanisms identified in these two conditions are an important tool for predicting how CAD-associated genetic variants affect SMC plasticity. Many data types are already available from preliminary studies, including RNA-seq, ATAC-Seq, H3K27ac, H3K4Me3, and H2BK20ac ChipSeq. We will focus our attention on loci already identified based on these data by the Civelek lab, which include 84 expression QTLs and 164 splice QTLs in SMCs that coincide with risk loci identified in genome-wide association studies (GWAS). Combining the new variant call sets derived from the WGS data generated as part of this project with the available RNA-seq data from SMCs, we will perform eQTL and sQTL analysis extending to indels and SVs.
Since open chromatin regions are where gene promoters and enhancers are located, the open chromatin information from ATAC-Seq allows us to interpret the eQTLs and sQTLs we find for indels and SVs. That is, variants hitting open chromatin regions are of special interest and would be excellent candidates for follow-up studies using functional assays. By virtue of having haplotype phasing data available, we will study haplotype-specific regulation. Two physically linked variant alleles, one displaying allelic imbalance in ATACseq data and one displaying allelic imbalance in RNAseq data, indicate cis-regulatory relationships. Finally, we will use TF binding motifs to identify the TFs whose binding may be affected by the genetic variants that display an allelic imbalance in regulatory regions. These results will determine TF gene expression relationships.