The Reproducibility Crisis Is Real

by Odd Erik Gundersen

The reproducibility crisis is real, and it is not only psychology that has to deal with it. All sciences are affected. The field of AI is not an exception.

In order to recover, one has to accept one that there is a problem. This is the first step. Say after me: “The reproducibility crisis is real, also for AI.” You might not be convinced yet, so let me try to convince you.

In 2016, a poll was conducted on Nature’s web site and the results were reported in the journal Nature (Baker, 2016). The poll was conducted as a brief online questionnaire in which 1576 researchers participated. 52% of the respondents answered that there is a significant reproducibility crisis going on while 38% thought there was a slight crisis, which makes me wonder about the term crisis. Could there be such a thing as a slight crisis? Nevertheless, only 3% believes that there is no crisis and 7% do not know. This means that 90% of those taking the poll believe that there is an ongoing reproducibility crisis.

Other questions were asked as well. Almost 90% of the scientists doing chemistry had failed to reproduce other researchers’ experiments while the numbers were just above 60% for respondents belonging to the group other sciences than those mentioned specifically in the article. For all groups, between 40% and 60% had failed to reproduce their own experiments (!). The respondents rated selective reporting as the factor that contributed the most to irreproducible research while other important factors included pressure to publish, low statistical power and poor analysis.

Of course, there are problems related to online polls. Nonresponse bias is one such problem. Not all researchers visit Nature’s website, and for those that do there is a response bias. People feeling strongly about something is more likely to take the poll. The article says nothing about how they were sure that only researchers responded to the poll, so there is a coverage bias as well. At least the sampling size is fair, so sampling bias should not be too problematic.

So, how about AI then? As part of the International conference on learning representations (ICLR) in 2018, the Reproducibility Challenge was organized. The challenge was to reproduce papers submitted through their open review process – while it was ongoing. This allowed the participants of this challenge to easily communicate with authors of the papers they tried to reproduce. In the end, 98 different researchers participated in the challenge. They were asked more or less the same questions as were asked in the Nature poll.

Before the challenge started, 22% of the participants believed that there was a significant crisis while 49% considered it to be slight. 17% were not sure and 11% thought there was no crisis at all. Interestingly, the participants were asked whether their opinion had changed after participating in the Reproducibility Challenge. 51% stated that their opinion had not changed, 11% were not sure, 8% were less convinced and 30% were more convinced that there is a reproducibility crisis.

The biases of this study are less problematic than those of the Nature website poll. Also, the study shows that most of the AI researchers partaking in the challenge believed there is a significant or slight reproducibility crisis going on, and even more so after trying to actually reproduce the results presented in papers. Joelle Pineau presented these results as part of her ICLR 2018 keynote, which you can easily find on Youtube. If you have 45 minutes to spare, I suggest you give it a try. I found the keynote very interesting.

It is clear that reproducibility is tightly connected to documentation of experiments. Any physics or chemistry student would know. They spend their first years at the university writing detailed lab reports. Needless to say, sharing is important as well. In what other way could fellow colleagues know about the research? To evaluate the results, they need to know what exactly was investigated and how the experiments were conducted. The more details the documentation contains, the easier it is for independent researchers to reproduce the results. In itself, good documentation builds trust in the results. Also, it lowers the barriers for others to actually run the experiment themselves, as more detailed documentation reduces the effort required to conduct the experiment.

Given that reproducibility requires good documentation, it is alarming how poorly top AI research is documented. Sigbjørn Kjensmo and I conducted a study where we reviewed 400 papers from two installments of IJCAI and AAAI, which are considered to be among the most prestigious conferences in our field of AI (Gundersen and Kjensmo, 2018).

Our survey shows that AI research is not well documented. Around 70% of the research that is published at AI experiments is empirical, but hypotheses nor our predictions are explicitly stated. These are the basis of the scientific method. The same goes for explicitly stating research questions and which research methods were used. Few explicitly state the objective of the research while the problem that was being solved was stated in less than half of the papers we reviewed.

Both IJCAI and AAAI are general AI conferences where top research in very narrow domains is presented. When writing a paper for a general conference, one should state why the presented research is relevant and important even though everyone in the subfield is fully aware of this. I have read papers presented at these top conferences where I did not understand why the authors conducted the research; they never even hinted at which problem they solved and why it was relevant to me.

Given that AI is a fairly young field of science, the research methodology and analytics methods are still being experimented with. This is mentioned by John P. A. Ioannidis in his famous 2005 paper Why most research findings are false. He presents several reasons for why most research findings are false for most research designs and for most fields. Let me mention a few.

There is no surprise that small sample sizes is a problem, but small effect sizes is as well. Effect size is related to how much better one method is when compared to another. This is worth remembering when a new method is only 0.5-1.5% better than the methods it is compared to, as this is in the small effect size range according to Ioannidis.

Problems are also related to the flexibility of study designs, definitions, analytical modes and hotness of the field. Generally, there is little focus on study design in AI, but some examples do exist, such as How evaluation guides AI research by Cohen and Howe (1988) and Cohen’s Empirical methods for artificial intelligence (1995). Do we as a community focus enough on research methods? Is this something we teach our master and PhD students?

Definitions is another example where the AI community could improve. Many of us have been involved om research that is described by terms such as context, pervasive computing, ambient intelligence and so on. I am not sure we agree what these terms exactly mean. For example, Bazire and Brezillon (2005) present a study of 150 definitions of context. According to rumors, Brezillon is still counting, and the number of definitions has at least doubled since 2005. Some would say that the term artificial intelligence is not well-defined itself. Computational intelligence has even been proposed as a better term, but until AI Magazine is renamed CI Magazine, I think I will stick to AI.

The number of analytical modes in modern machine learning experiments is huge. The algorithms that are evaluated might have millions of hyperparameters. Do our experiments find actual patterns that are generalizable or are we just searching for hyperparameters that find random patterns existing in both training and test set?

When it comes to hotness, few fields can compete with AI these days. Just look at the amount of papers submitted to the top conferences. Almost 8000 papers were submitted to AAAI 2019 and more than 4700 to IJCAI 2019. Both conferences had a record number of submissions in 2019. It is a great time to be an AI professional both in the industry and academia, as research grants and project funding are abundant. According to Ioannidis, this affects research results. Competition makes it more important to pursue and disseminate the most impressive positive results first. When this happens, the focus on research methodology might slip. It is not hard to relate to the competitiveness, at least for anyone doing research in deep learning.

Another part documenting an experiment is the data. There has been a focus on data sharing since the UCI machine learning repository was created in 1987 by David Aha and fellow graduate students. Sharing of data facilitates that others can reproduce the experiments and test their own ideas on standard data sets. In our survey, Kjensmo and I found that around half of the papers shares the data used for conducting the experiment. We did not assess how many of them used standard data sets and how many released new ones though.

Some argue that standard data sets that are free for everyone leads to researchers focusing on making small increments of improvement on methods that solve the same problem. Then, the research gets very narrow. I acknowledge this sentiment. However, it does not mean that we should keep our data sets private, just that we should share the data we work on. This is not always possible though. Privacy and competitive advantages are just two of the reasons sharing of data is hard. However, all is not lost. According to Pineau et al. (2020), introducing a volunteer reproducibility checklist at NeurIPS and ICML increased the submissions with code to around 75%.

Anyway, the AI community’s focus on open data sets has produced results, at least when compared to sharing code, which has had less focus. Even in the age of open source software, only eight percent of the papers shared code compared to 56% who shared data according to our study. Code repositories, such as Github, simplify sharing and they are used by most developers, also AI researchers. Why are the numbers so low for open source experiments when compared to open data?

For most of the experiments that are conducted in AI research, everything that is needed to run the experiment is available on computers. In theory this, should make reproducibility much easier. However, running experiments completely on computers do not solve everything when it comes to reproducibility. Henderson et al. (2018) discuss problems with reproducing results in deep reinforcement learning that related to hyperparameters, random seeds and even which implementation of a baseline algorithm that is used for comparison. Prabhat et al. (2019) show how hard it is to get a deterministic algorithm to run on a GPU. Even when succeeding on one computer, the results are completely different, but still deterministic, on another computer.

Floating point calculations is a science on its own that causes a lot of pain for those of us who depend on millions of them when running our experiments. Hong et al. (2013) found that changing operating systems, compilers and hardware led to the same variations in weather simulations as changing the initial conditions. Even getting code published by others to run is hard. Collberg and Proebsting (2009) tried to run the code of 402 experimental papers. They were successful in 32.3% without communicating with the authors. The numbers increased to 48.3% after communicating with the authors.

In order to recover, we have to accept that we have a problem. This is the first step. Say after me: “The reproducibility crisis is real, also for AI.”

Now, we can take the next step towards recovery.

References

Baker, M. (2016). Is there a reproducibility crisis? A Nature survey lifts the lid on how researchers view the’crisis rocking science and what they think will help. Nature, 533(7604), 452-455.
Bazire, M., & Brézillon, P. (2005). Understanding context before using it. In International and Interdisciplinary Conference on Modeling and Using Context (pp. 29-40). Springer, Berlin, Heidelberg.
Cohen, P. R. (1995). Empirical methods for artificial intelligence (Vol. 139). Cambridge, MA: MIT press.
Cohen, P. R., & Howe, A. E. (1988). How evaluation guides AI research: The message still counts more than the medium. AI magazine, 9(4), 35-35.
Collberg, C., & Proebsting, T. A. (2016). Repeatability in computer systems research. Communications of the ACM, 59(3), 62-69.
Gundersen, O. E., & Kjensmo, S. (2018). State of the art: Reproducibility in artificial intelligence. In Thirty-Second AAAI Conference on Artificial Intelligence.
Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., & Meger, D. (2018, April). Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artificial Intelligence.
Hong, S. Y., Koo, M. S., Jang, J., Esther Kim, J. E., Park, H., Joh, M. S., … & Oh, T. J. (2013). An evaluation of the software system dependency of a global atmospheric model. Monthly Weather Review, 141(11), 4165-4172.
Ioannidis, J. P. (2005). Why most published research findings are false. PLoS medicine, 2(8), e124.
Nagarajan, P., Warnell, G., & Stone, P. (2019). The Impact of Nondeterminism on Reproducibility in Deep Reinforcement Learning. Presented at the 2019 Workshop on Reproducible AI.
Pineau, J., Vincent-Lamarre, P., Sinha, K., Larivière, V., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Larochelle, H. (2020). Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program).

Tags