Computational Genomics Group
  • Home
  • Research
  • Publications
  • Teaching
  • Blog
  • Group Members
  • News
  • Computational Biology Book
  • Data Analysis with R Book
  • CG2 github
  • Fiction

TADs in yeast and how you can go around a reviewer (if you are right)

4/20/2017

27 Comments

 
Just a few weeks ago we published a paper on Genome Urbanization, a concept describing the spatial clustering of genes in the unicellular eukaryote genome of S. cerevisiae (you can read more about it here). One of the things we put forward in that paper was the existence of discrete topological domains in the yeast genome that strongly resembled the Topologically-Associated Domains (TADs) initially discovered in mammals and now found in most complex eukaryotes. We based our arguments on some rather clear topological boundaries that we were able to observe on HiC contact maps obtained from a widely cited (Duan et al, 2010) public dataset. As you may see in the bottom of the Figure below (adapted from Figure 4D of our paper), one can locate boundaries between TAD-like domains even by eye inspection. In order to do so, we used an insulating approach (as described by Crane et al, Nature 2015) that is largely independent of the local read density. After defining such insulating regions we went on to show that genes that are up-regulated upon topological stress tend to cluster within these regions. 

​But since TADs had not (at the time) been reported in yeast, one of the main criticisms that we received during the review process was directed at this analysis. The reviewer's comment was:
"Authors declare the existence of TAD-like globules in budding yeast. However, these kinds of structures have, so far, never been detected in Saccharomyces cerevisiae. If the authors want to establish the existence of such TAD like structures, they must reinforce their analysis."

At the time, we were eager to get the paper accepted and so we down-played the original term "TAD-like" in "insulated domains" in the final version. Figure 4D remained though and was further supported by a number of analyses that showed the robustness of the boundaries upon different normalization strategies and the lack of LTRs in these regions. In our view, it mattered little to get the message of "TADs also exist in yeast" across, as our main interest was to show the tendency for spatial clustering of genes.
Picture
HiC contact maps for yeast chromosome IV. Top: Figure 1B from Eser et al. (PNAS, 2017). Bottom: Figure 4D from our Genome Urbanization paper (Tsochatzidou et al, Nucleic Acids Res, 2017). Even though the original HiC datasets are not the same (top: Noble lab 2017, bottom: Noble lab 2010) the maps show great similarity. The location of the boundaries shows significant discrepancies as the way to define them differs (see text below).

We were nevertheless right all along as was only recently shown in a paper by the Noble lab (whose original data we had used in our analysis). In a paper that just came out in PNAS, Eser et al., show that TADs do exist in yeast and that they have some very interesting properties. Eser et al., use a new HiC dataset (it seems that you cannot escape the curse of having to RE-do the experiments even if you were the one to perform them originally) but apply a different method to call the boundaries. Their "coverage score" is interesting as an approach because it is insensitive to the resolution of the obtained boundaries (a problem we had to deal with by arbitrarily choosing a 10kb window) and leads to fewer TADs that we were able to define but this is likely related to thresholds in the calling process (we used a 5%-percentile approach, while Eser et al use a local minimum function). Eser et al, find 41 TADs, (we found 85) with a median size that is more than double of the one we found (260kb vs our 100kb). The fact that remains is that there is significant coincidence between both the maps and the boundaries as you can see in the Figure above (adapted from Figures 1B from Eser et al, 2017 and 4B from our paper). 

​What is more important, the properties that are shared by the TAD boundaries in Eser et al. are matching our observations in many respects, as they are shown to be enriched in transcription activity (as originally shown by the group of Sergei Razin) and activating histone marks. Above all, Eser et al. report that regions between the defined TADs are significantly overlapping areas of topoII depletion, which constitutes an immediate link to our finding that genes that tend to be up-regulated by topoII inactivation are enriched in insulated regions (remember this was our way to call TAD boundaries, without using the term "TAD"). Thus, even though one of the main arguments in Eser et al. is that TADs in yeast are mostly related to DNA replication than transcription, it seems that you cannot really do away with transcriptional effects in relation to chromatin organization, especially in a unicellular eukaryote genome where the two processes are expected to be more tightly connected.


In closing, we can now be confident that TADs, or TAD-like domains if you will, do exist in yeast and that they are inherently related to both DNA replication and transcription (even though perhaps indirectly).  Our observations under topological stress lie in the interphase between the two processes as torsional stress accumulation inevitably affects the DNA replication process. Even more interesting, in our view, is the fact that the embedded constraints in the organization of genome architecture are reflected on the evolution of gene distribution along chromosomes, but then again we have already discussed this elsewhere. One last point that we can make is that it is reassuring to see you can constructively argue with a reviewer (provided the reviewer's sanity) if your hypothesis is solid and supported by the data and that it is always nice to see you were right in the first place, even though sometimes courtesy towards a reviewer obliges you to be less audacious in the choice of terminology.
27 Comments

Standardization and availability of data come first. Reproducibility second.

4/9/2017

0 Comments

 
Last week a new paper came out from our group's collaboration with the group of Niki Kretsovali at the IMBB, FORTH. Appearing in Stem Cell Reports, this work by lead author and friend Christiana Hadjimichael shows that Promyelocytic Leukemia Protein (PML) is a key factor in the maintenance of pluripotency in mouse ES cells, exhibiting its control on a number of pathways including that of TGF-β at an early stage. The main point of this work is that it provides a link between a protein that has been extensively studied in a different context (its role in cell cycle and apoptosis) and a developmental process such as cell differentiation.

Our group's involvement in this work was principally related to the analysis of gene expression data, the identification and prioritization of differentially expressed genes and (at the revision stage) the comparison of our data with publicly available gene expression profiles in order to validate the main finding of the paper, that PML knockdown cells have a profile that is similar to differentiated epiblast-like cells. 
Antonis Klonizakis, an undergraduate student from our group, had to go through the mill of finding, analyzing and comparing public gene expression datasets to show that PML knockdown cells resemble a state of primed differentiation, lying intermediate between ES and epiblast cells. 
Picture
But the question that prompted this post was exactly this: Why should Antonis go through the mill to do something that sounded perfectly straight-forward in the first place? There are now thousands of avaiable gene expression experiments conducted and reported every year. Why did he then suffer so much to locate just a couple to compare with Christiana's dataset? The answer is that when it comes to data availability and standardization the situation is far from "straight-forward". Starting from the beginning, I supervised most of Antonis' search only to realize that getting data was much more difficult than what we had initially thought.

First of all, there is biological variability that you can't do away with. Stem cells come in different types, "flavours" and various cell lines that are as different from each other as they are compared to other cell types. Even then, after having located profiles of the same kind of stem cells we were dealing with, we were faced with problems that had to do with the standardization of data processing. Many (most) papers failed to adequately report the data processing steps and thus we were unable to reproduce the results they were reporting by analyzing the raw data ourselves. This may sound like an excuse for irreproducibility but in my view it is the main reason behind many of the irreproducibiliy issues in research, that recently have even caught the attention of media such as the Wall Street Journal (as if they didn't have enough Wall Street-related problems to deal with already). Lack of standardization is a major issue for two reasons: first, it makes it very likely that the results that you come up with by repeating a series of complicated processing steps (that are not thoroughly reported in the original paper) do not match the ones reported and second, it makes the whole idea of comparing data so discouraging that in many cases it is preferable to repeat the whole experiment yourself. To the non-biomedically-oriented readers this may sound like an incredible waste of time, money and human resources but it is so commonplace that it was the original suggestion by the reviewers of Christiana's paper. What they asked for was to conduct gene expression analyses in other ESC lines, for which data were surely available already.

Going beyond standardization the situation becomes even worse when one considers the availability of data. In their search for datasets to compare with, Antonis and Christiana came up with papers such as this one, for which the data were not only not standardized but not even reported. That is right! You skim the paper for GEO or ArrayExpress links and find none, you read it carefully, you go through the (rudimentary) supplementary data and you still find nothing, then you (in this case I) write to both the corresponding author and the editors of a respectable journal and you are still waiting for an answer three months later. Such situations may (and should) be unheard of in other contexts but are somehow acceptable in the highly competitive field of biomedicine. To people like us, though, that are hoping that the accumulation, cross-comparison and validation of data may be a way to acquire new  knowledge all this is particularly disturbing. Not least because it makes our work harder, but because it also makes everyone else's less significant.



0 Comments

Footballomics: A modified Gini Coefficient for Club Performance

4/8/2017

0 Comments

 
Rank Gini coefficient
The last time we talked about Footbalomics (or the analysis of football data) we discussed a marked disparity in the performance of a certain Premier League club (Liverpool FC) in dealing with top competitors as opposed to lesser opponents. In that post we saw that LFC were doing remarkably well against better teams, while the norm is that most of the clubs do better when playing inferior teams (as expected) and a few do equally well against all (PL leaders Chelsea is a notable example). 
Picture
The observed disparity prompted me to attempt to summarise this trend and its fluctuation among clubs with one value. The difference between top10/bottom10 or even top6/bottom6 may not be very useful since the margins may vary depending on the skewed point distributions in the league. Other leagues may be tighter while others (e.g. the Spanish or the German, not to talk of the Greek) are one- or two-horse races. A concept that may be useful here is that of the Gini coefficient. To the uninitialized, the Gini coefficient is a measure of statistical dispersion. First introduced by Corrado Gini around the turn of the 20th century, it has earned significant attention at the turn of the 21st since it can be used to describe distribution disparity as it has, repeatedly, in the case of income distributions. In a nutshell, the Gini index tells you how much a certain value is distributed evenly on in a highly skewed manner. Assuming that all N citizens of country X share the same amount of its GDP, and thus each earns GDP/N, gives a Gini of 0, while in the (much more likely) case that one person earns all the GDP leaving 0 to everybody else, gives a Gini Index of 1. Real-life Gini coefficients range between the 0.30 and 0.70.

The Question: How can we apply a Gini coefficient in the case of football performance?
If you are interested in knowing more about how a mundane statistic like the Gini Index can provide insight on the performance of a club or the structure of a whole league you may want to read on here.
0 Comments

    RSS Feed

    It's all about...

    Bioinformatics and computational biology with a focus on chromatin and genome architecture, plus a little bit of football and occasional aspects of  University education.

    Archives

    April 2021
    December 2020
    March 2020
    November 2018
    September 2017
    April 2017
    March 2017
    December 2016
    November 2016
    February 2016
    May 2015
    November 2014
    September 2014
    July 2014
    February 2014
    November 2013
    October 2013

    Categories

    All
    Academic Life
    Bioinformatics
    ChIPSeq
    ChIPSeq Bias
    Cpg Islands
    Data Analysis
    Exons
    Football
    Footballomics
    Gene Regulation
    Genetic Diseases
    Genome Architecture
    Genome Structure
    Inflammation
    Journalism
    Math Illiteracy
    NGS
    Nucleosome Positioning
    Nucleotide Composition
    Nucleotide Skews
    Promoters
    R
    Splicing
    Statistics
    Systems Biology
    Tnf
    Transcriptome
    Variation
    Whole Exome

Powered by Create your own unique website with customizable templates.