Standardization and availability of data come first. Reproducibility second.

4/9/2017

Last week a new paper came out from our group's collaboration with the group of Niki Kretsovali at the IMBB, FORTH. Appearing in Stem Cell Reports, this work by lead author and friend Christiana Hadjimichael shows that Promyelocytic Leukemia Protein (PML) is a key factor in the maintenance of pluripotency in mouse ES cells, exhibiting its control on a number of pathways including that of TGF-β at an early stage. The main point of this work is that it provides a link between a protein that has been extensively studied in a different context (its role in cell cycle and apoptosis) and a developmental process such as cell differentiation.

Our group's involvement in this work was principally related to the analysis of gene expression data, the identification and prioritization of differentially expressed genes and (at the revision stage) the comparison of our data with publicly available gene expression profiles in order to validate the main finding of the paper, that PML knockdown cells have a profile that is similar to differentiated epiblast-like cells. Antonis Klonizakis, an undergraduate student from our group, had to go through the mill of finding, analyzing and comparing public gene expression datasets to show that PML knockdown cells resemble a state of primed differentiation, lying intermediate between ES and epiblast cells.

But the question that prompted this post was exactly this: Why should Antonis go through the mill to do something that sounded perfectly straight-forward in the first place? There are now thousands of avaiable gene expression experiments conducted and reported every year. Why did he then suffer so much to locate just a couple to compare with Christiana's dataset? The answer is that when it comes to data availability and standardization the situation is far from "straight-forward". Starting from the beginning, I supervised most of Antonis' search only to realize that getting data was much more difficult than what we had initially thought.

First of all, there is biological variability that you can't do away with. Stem cells come in different types, "flavours" and various cell lines that are as different from each other as they are compared to other cell types. Even then, after having located profiles of the same kind of stem cells we were dealing with, we were faced with problems that had to do with the standardization of data processing. Many (most) papers failed to adequately report the data processing steps and thus we were unable to reproduce the results they were reporting by analyzing the raw data ourselves. This may sound like an excuse for irreproducibility but in my view it is the main reason behind many of the irreproducibiliy issues in research, that recently have even caught the attention of media such as the Wall Street Journal (as if they didn't have enough Wall Street-related problems to deal with already). Lack of standardization is a major issue for two reasons: first, it makes it very likely that the results that you come up with by repeating a series of complicated processing steps (that are not thoroughly reported in the original paper) do not match the ones reported and second, it makes the whole idea of comparing data so discouraging that in many cases it is preferable to repeat the whole experiment yourself. To the non-biomedically-oriented readers this may sound like an incredible waste of time, money and human resources but it is so commonplace that it was the original suggestion by the reviewers of Christiana's paper. What they asked for was to conduct gene expression analyses in other ESC lines, for which data were surely available already.

Going beyond standardization the situation becomes even worse when one considers the availability of data. In their search for datasets to compare with, Antonis and Christiana came up with papers such as this one, for which the data were not only not standardized but not even reported. That is right! You skim the paper for GEO or ArrayExpress links and find none, you read it carefully, you go through the (rudimentary) supplementary data and you still find nothing, then you (in this case I) write to both the corresponding author and the editors of a respectable journal and you are still waiting for an answer three months later. Such situations may (and should) be unheard of in other contexts but are somehow acceptable in the highly competitive field of biomedicine. To people like us, though, that are hoping that the accumulation, cross-comparison and validation of data may be a way to acquire new knowledge all this is particularly disturbing. Not least because it makes our work harder, but because it also makes everyone else's less significant.

0 Comments

Standardization and availability of data come first. Reproducibility second.

Leave a Reply.

It's all about...

Archives

Categories