Ten Quick Tips for Deep Learning in Biology

Benjamin D. Lee; Alexander J. Titus; Kun-Hsing Yu; Marc G. Chevrette; Paul Allen Stewart; Evan M. Cofer; Sebastian Raschka; Finlay Maguire; Benjamin J. Lengerich; Alexandr A. Kalinin; Anthony Gitter; Casey S. Greene; Simina M. Boca; Timothy J. Triche, Jr.; Thiago Britto-Borges; Elana J. Fertig; Michael D. Kessler; Alexandra J. Lee; Beth Signal; Juan Jose Carmona

Please note the current author order is chronological and does not reflect the final order.

Introduction

Machine learning is a modern approach to problem-solving and task automation. In particular, machine learning is concerned with the development and applications of algorithms that can recognize patterns in data and use them for predictive modeling, as opposed to having domain experts developing rules for prediction tasks manually. Artificial neural networks are a particular class of machine learning algorithms and models that evolved into what we now describe as “deep learning”. Deep learning encompasses neural networks with many layers and the algorithms that make them perform well. These neural networks comprise artificial neurons arranged into layers and are modeled after the human brain, even though the building blocks and learning algorithms may differ [1]. Each layer receives input from previous layers (the first of which represents the input data) and then transmits a transformed version of its input to the subsequent layer. Thus, the process of “training” a neural network is the tuning of the layers’ weights to minimize a cost or loss function that serves as a surrogate of the prediction error. The loss function is differentiable so that the weights can be automatically updated to attempt to reduce the loss. Deep learning uses artificial neural networks with many layers (hence the term “deep”). Given the computational advances made in the last decade, it can now be applied to massive data sets and in innumerable contexts. In many circumstances, deep learning can learn more complex relationships and make more accurate predictions than other methods. Therefore, deep learning has become its own subfield of machine learning. In the context of biological research, it has been increasingly used to derive novel insights from high-dimensional biological data [2]. For example, deep learning has been used to predict protein-drug binding kinetics [3], to identify the lab-of-origin of synthetic DNA [4], and to uncover the facial phenotypes of genetic disorders [5].

To make the biological applications of deep learning more accessible to scientists who have some experience with machine learning, we solicited input from a community of researchers with varied biological and deep learning interests. These individuals collaboratively contributed to this manuscript’s writing using the GitHub version control platform [6] and the Manubot manuscript generation toolset [7]. The goal was to articulate a practical, accessible, and concise set of guidelines and suggestions to follow when using deep learning (Figure 1). For readers who are new to machine learning, we recommend reviewing general machine learning principles [8] before getting started with deep learning.

In the course of our discussions, several themes became clear: the importance of understanding and applying machine learning fundamentals as a baseline for utilizing deep learning, the necessity for extensive model comparisons with careful evaluation, and the need for critical thought in interpreting results generated by deep learning, among others. The major similarities between deep learning and traditional computational methods also became apparent. Although deep learning is a distinct subfield of machine learning, it is still a subfield. It is subject to the many limitations inherent to machine learning, and most best practices for machine learning [9,10] also apply to deep learning. As with all computational methods, deep learning should be applied in a systematic manner that is reproducible and rigorously tested. Ultimately, the tips we collate range from high-level guidance to best practices for implementation. It is our hope that they will provide actionable, deep learning-specific instruction for both new and experienced deep learning practitioners. By making deep learning more accessible for use in biological research, we aim to improve the overall usage and reporting quality of deep learning in the literature and enable increasing numbers of researchers to effectively and accurately use these state-of-the-art techniques.

Tip 1: Decide whether deep learning is appropriate for your problem

In recent years, the number of projects and publications implementing deep learning in biology has risen tremendously [11,12,13]. Given deep learning’s usefulness across a range of scientific questions and data modalities, it may seem as though it is a panacea for nearly all modeling problems. Neural networks that underpin deep learning models are, in fact, universal function approximators and are therefore theoretically capable of learning the functions that relate almost any input and output variables [14,15]. However, deep learning is not suited to every modeling situation. The primary limiting factors are the training demands of neural network models, which require significant amounts of data, computing power, and programming as well as modeling expertise.

In the areas of biology where data collection is thoroughly automated, such as DNA sequencing, large amounts of high-quality data may be available. However, areas of biology that rely on manual data collection may not possess enough data to train and apply deep learning models effectively. Though there are methods that try to increase the amount of training data, such as data augmentation (in which existing data is slightly manipulated in an attempt to yield “new” samples) and weak supervision (in which simple labeling heuristics are combined to produce noisy, probabilistic labels) [16], these methods cannot overcome substantial data shortages.

In the fields of computer vision and natural language processing, deep neural networks are routinely trained on sample sizes ranging from hundreds of thousands to millions of training examples. Datasets of this size are often not available in many biological contexts. Still, it has been found that under certain circumstances, deep learning can be considered for datasets with at least one hundred samples per class [17]. However, it is best suited for datasets that contain orders of magnitude more samples.

Training deep learning models often requires extensive computing infrastructure and patience to achieve state-of-the-art performance [18]. In some deep learning contexts, such as generating human-like text, state-of-the-art models have over one hundred billion parameters [19] and require very costly and time-consuming training procedures [20]. These types of large language models are being used in biology to learn representations of protein sequences [21,22,23]. Even those most deep learning applications in biology rarely require this much training, they can still require computational resources beyond those available on consumer-grade devices such as laptops or office desktops. Specialized hardware such as discrete graphics processing units (GPUs) and custom deep learning accelerators can dramatically reduce the time and cost required to train models [13]. Still, this hardware is not universally accessible, and cloud-based rentals add additional cost and complexity. These specialized hardware solutions are likely to be more broadly available as deep learning becomes more popular. For example, recent-generation iPhones already have such hardware. In contrast to the large scale computational demands of deep learning, traditional machine learning models can often be trained on laptops (or even on a $5 computer [24]) in seconds to minutes. Therefore, due to this enormous disparity in resource demand alone, traditional machine learning approaches may be desirable in various biological applications.

Beyond requiring more data and computational capacity, building and training deep learning models often requires more expertise than training traditional machine learning models. There are currently several popular programming frameworks for deep learning, such as Tensorflow [25] and PyTorch [26]. While these tools allow users to create and deploy entirely novel model architectures, this flexibility combined with the rapid development of the deep learning field has resulted in large and complex frameworks that can be daunting to new users. For readers new to software development but experienced in biology, gaining computational skills while interfacing with such complex industrial-grade tools can be a prohibitive challenge. Conversely, traditional machine learning methods are generally more straightforward to apply and are also accessible through popular frameworks [27]. There are currently more tools for automating the model selection and training process for traditional machine learning models than for deep learning models. For example, automated machine learning (AutoML) tools, such as TPOT [28] and Turi Create [29], are able to test and optimize multiple machine learning models automatically, and can allow users to achieve competitive performance with only a few lines of code. There are efforts underway to extend these and other automation frameworks to reduce the expertise required to build and use deep learning models. For example, TPOT, Turi Create, and AutoKeras [30] are already capable of abstracting away much of the programming required for “standard” deep learning tasks, and high-level interfaces such as Keras [31] and Fastai [32], make it increasingly straightforward to design and test custom deep learning architectures In the future, projects such as these are likely to make deep learning accessible to a much wider swath of researchers.

Despite these limitations, deep learning is strongly indicated over traditional machine learning for specific research questions and problems. In general, these include problems that feature hidden patterns across the data, complex relationships, and interrelated variables. Problems in computer vision and natural language processing often exhibit these very features, which helps explain why these areas were some of the first to experience significant breakthroughs during the recent deep learning revolution [33]. As long as large amounts of accurate and labeled data are available, applications to areas of life sciences with related data characteristics, such as genetic medicine [34], radiology [35], microscopy [36], and pharmacovigilance [37], are similarly likely to benefit from deep learning techniques. For example, Ferreira et al. used deep learning to recognize individual birds from images [38] despite this problem being very difficult historically. By combining automatic data collection using radio-frequency identification tags with data augmentation and transfer learning, the authors were able to use deep learning to achieve 90% accuracy across several species. Another research area where deep learning excels is generative modeling, where new samples are created based on the training data [39]. For example, deep learning can generate realistic compendia of gene expression samples [40]. One other area of machine learning that has been revolutionized by deep learning is reinforcement learning, which is concerned with training agents to interact with an environment [41]. Reinforcement learning has been applied to design druglike small molecules [42]. Overall, initial evaluation as to whether similar problems (including analogous ones in other domains) have been solved successfully using deep learning can inform researchers about the potential for deep learning to address their needs.

On the other hand, depending on the amount and type of data available and the nature of the problem set, deep learning may not always outperform conventional methods. As an illustration, Rajkomar et al. [43] found that simpler baseline models achieved performance comparable with deep learning in several clinical prediction tasks using electronic health records. Another example is provided by Koutsoukas et al., who benchmarked several traditional machine learning approaches against deep neural networks for modeling bioactivity data on moderately sized datasets [44]. The researchers found that while well-tuned deep learning approaches generally tend to outperform conventional classifiers, simple methods such as Naive Bayes classification tend to outperform deep learning as the dataset’s noise increases. Similarly, Chen et al. [45] tested deep learning and a variety of traditional machine learning methods such as logistic regression and random forests on five different clinical datasets. They found that traditional methods matched or exceeded the accuracy of the deep learning model in all cases despite requiring an order of magnitude less training time.

Therefore, in conclusion, deep learning should only be used after a robust consideration of its strengths and weaknesses for the problem at hand. After choosing deep learning as a potential solution, practitioners should still consider traditional methods as performance baselines and use the scientific method to compare the performance of deep learning to that of traditional methods.

Tip 2: Use traditional methods to establish performance baselines

Deep learning requires practitioners to consider a larger number and variety of tuning parameters (that is, algorithmic settings) than more traditional machine learning methods. These settings are often called hyperparameters. Their extensiveness can make it easy to fall into the trap of performing an unnecessarily convoluted analysis. Hence, before applying deep learning to a given problem, we highly recommend implementing simpler models with fewer hyperparameters at the beginning of each study. Such models include logistic regression, random forests, k-nearest neighbors, Naive Bayes, and support vector machines. They can help to establish baseline performance expectations and the difficultly of a specific prediction problem. While performance baselines available from existing literature can also serve as helpful guides, an implementation of a simpler model that uses the same software framework as planned for deep learning can greatly help with assessing the correctness of data processing steps, performance evaluation pipelines, resource requirement estimates, and computational performance estimates. Furthermore, in some cases, it can even be useful to combine simpler baseline models with deep neural networks, as such hybrid models can improve generalization performance, model interpretability, and confidence estimation [46,47].

Another potential pitfall arises from comparing the performance of baseline conventional models trained with default settings with the performance of deep learning models that have undergone rigorous tuning and optimization. Since conventional off-the-shelf machine learning algorithms (for example, support vector machines and random forests) are also likely to benefit from hyperparameter tuning, such incongruity prevents the comparison of equally optimized models and can lead to false conclusions about model efficacy. Hu and Greene [48] discuss this under the umbrella of what they call the “Continental Breakfast Included” effect. They describe how the unequal tuning of hyperparameters across different learning algorithms can especially skew evaluation when the performance of an algorithm varies substantially with modest changes to its hyperparameters. Therefore, practitioners should tune the settings of both traditional machine learning and deep learning-based methods before making claims about relative performance differences. Performance comparisons among machine learning and deep learning models are only informative when the models are equally well optimized.

To sum this tip up, practitioners are encouraged to create and fully tune several traditional models and standard pipelines before implementing a deep learning model.

Tip 3: Understand the complexities of training deep neural networks

Correctly training deep neural networks is non-trivial. There are many different options and potential pitfalls at every stage. To get good results, one must often train networks across a wide range of different hyperparameter settings. Such training can be made more difficult by the demanding nature of these deep networks, which often require extensive time investments into tuning and computing infrastructure to achieve state-of-the-art performance [18]. Furthermore, this experimentation is often noisy, necessitating increased repetition and exacerbating the organizational challenges inherent to deep learning. On the whole, all code, random seeds, parameters, and results must be carefully corralled using general coding standards and best practices (for example, version control [49] and continuous integration [50]) to be reproducible and interpretable [51,52]. For application-based research, this organization is also fundamental to the efficient sharing of research work and the ability to keep models up to date as new data becomes available.

One specific reproducibility pitfall that is often missed in applying deep learning is the default use of non-deterministic algorithms by CUDA/CuDNN backends when using GPUs. That is, the CUDA/CuDNN architectures that facilitate the parallelized computing that power state-of-the-art deep learning often use algorithms by default that produce different outcomes from iteration to iteration. Therefore, achieving reproducibility in this context requires explicitly specifying the use of deterministic algorithms (which are typically available within deep learning libraries), which is distinct from the setting of random seeds that typically achieve reproducibility by controlling pseudorandom deterministic procedures such as shuffling and initialization [53].

Similar to the suggestions above about starting with simpler models, try to start with a relatively small network and then increase the size and complexity as needed. This can help prevent practitioners from wasting significant time and resources on running highly complex models that feature numerous unresolved problems. Again, beware of the choices made implicitly (that is, by default settings) by deep learning libraries. These seemingly trivial details, such as the selection of optimization algorithm, can have significant effects on model performance. For example, adaptive methods often lead to faster convergence during training but may lead to worse generalization performance on independent datasets [54]. These nuanced elements are easy to overlook, but it is critical to consider them carefully and to evaluate their potential impact.

In short, use smaller and simpler networks to enable faster prototyping, follow general software development best practices to maximize reproducibility, and check software documentation to understand default choices.

Tip 4: Know your data and your question

Having a well defined scientific question and a clear analysis plan is crucial for carrying out a successful deep learning project. Just like it would be inadvisable to set foot in a laboratory and begin experiments without having a defined endpoint, a deep learning project should not be undertaken without defined goals. Foremost, it is important to assess if a dataset exists that can answer the biological question of interest using a deep learning-based approach. If so, obtaining this data (and associated metadata) and reviewing the study protocol should be pursued as early on in the project as possible. This can help to ensure that data is as expected and can prevent the wasted time and effort that occur when issues are discovered later on in the analytic process. For example, a publication or resource might purportedly offer an appropriate dataset that is found to be inadequate upon acquisition. The data may be unstructured when it is supposed to be structured, crucial metadata such as sample stratification might be missing, or the usable sample size may be different than expected. Any of these data issues might limit a researcher’s ability to use deep learning to address the biological question at hand or might otherwise require adjustment before it can be used. Data collection should also be carefully documented, or a data collection protocol should be created and specified in the project documentation.

Information about the resources used, download dates, and dataset versions are critical to preserve. Doing so will help to minimize operational confusion and will increase the reproducibility of the analysis. Best practices for reproducibility also include sharing the collected dataset and metadata along upon publication of the study, ideally in a public dataset repository if there are no ethical or privacy concerns and no copyright restrictions. While recommended and recognized dataset repositories may differ across disciplines, a list of general dataset repositories includes the Dryad repository [55] (https://datadryad.org/), Figshare [56] (https://figshare.com), Zenodo [57] (https://zenodo.org), and the Open Science Framework [58] (https://osf.io). In addition, Gundersen et al. [59] provide useful checklists summarizing best data sharing practices for reproducible research and open science.

Once the dataset is obtained, it is important to learn why and how the data were collected before beginning analysis. The standardized metadata that exists in many fields can help with this (for example, see [60]). If at all possible, we recommend consulting with a subject matter expert who has experience with the type of data being used. Doing so will minimize guesswork and is likely to increase the success rate of a deep learning project. For example, one might presume that data collected to test the impact of an intervention derives from a randomized controlled trial. However, this is not always the case, as ethical or practical concerns often necessitate an observational study design that features prospectively or retrospectively collected data. In order to ensure similar distributions of important characteristics across study groups in the absence of randomization, such a study may have selected individuals in a fashion that best matches attributes such as age, gender, or weight. Passively collected datasets can have their own peculiarities, and other study designs can include samples that originate from the same study site, the oversampling of ethnic groups or zip codes, or sample processing differences. Such information is critical to accurate data analysis, and so it is imperative that practitioners learn about study design assumptions and data specificities prior to performing modeling. Other study design considerations that should not be overlooked include knowing whether a study involves biological or technical replicates or both. For example, the existence in a dataset of samples collected from the same individuals at different time points can have significant effects on analyses that make assumptions about sample size and independence (that is, non-independence can lower the effective sample size). Another potential issue is the existence of systematic biases, which can be induced by confounding variables and can lead to artifacts or so-called “batch effects.” Consequently, models may learn to rely on the correlations that these systematic biases underpin, even though they are irrelevant to the scientific context of the study. This can lead to misguided predictions and misleading conclusions [61]. Unsupervised learning and other exploratory analyses can help identify such biases in these datasets before applying a deep learning model.

Overall, practitioners should thoroughly study their data and understand its context and peculiarities before moving on to performing deep learning.

Tip 5: Choose an appropriate data representation and neural network architecture

Neural network architecture refers to the number and types of layers in the network and how they are connected. While certain best practices have been established by the research community [62], architecture design choices remain largely problem-specific and are vastly empirical efforts requiring extensive experimentation. Furthermore, as deep learning is a quickly evolving field, many recommendations are often short-lived and are frequently replaced by newer insights supported by recent empirical results. This is further complicated by the fact that many recommendations do not generalize well across different problems and datasets. Therefore, choosing how to represent data and design an architecture is closer to an art than a science. That said, there are some general principles to follow when experimenting.

First and foremost, use your knowledge of the available data and your question to inform your data representation and architectural design choices. For example, if the dataset is an array of measurements with no natural ordering of inputs (such as gene expression data), multilayer perceptrons may be effective. These are the most basic type of neural network. They are able to learn complex non-linear relationships across the input data despite their relative simplicity. Similarly, if the dataset is comprised of images, convolutional neural networks (CNNs) are a good choice because they emphasize local structures and adjacency within the data. CNNs may also be a good choice for learning on sequences. Recent empirical evidence suggests that they can outperform canonical sequence learning techniques such as recurrent neural networks and the closely related long short-term memory networks in some cases [63]. Transformers are powerful sequence models [64] but require extensive data and computing power to train from scratch. Accessible high-level overviews of different neural network architectures are provided in [65] and [66].

Deep learning models typically benefit from increasing the amount of labeled training data. Large amounts of data help to avoid overfitting and increase the likelihood of achieving top performance on a given task. If there is not enough data available to train a well-performing model, consider using transfer learning. In transfer learning, a model whose weights were generated by training on another dataset is used as the starting point for training [67]. Transfer learning is most useful when the pre-training and target datasets are of similar nature [67]. For this reason, it is important to search for similar datasets that are already available. However, even when similar datasets do not exist, transferring features can still improve model performance compared with random feature initialization. For example, Rajkomar et al. showed advantages of ImageNet-pretraining [68] for a model that is applied to grayscale medical image classification [69]. Pre-trained models can be obtained from public repositories, such as Kipoi [70] for genomics or Hugging Face [71] for biomedical text [72], protein sequences [22], and chemicals [73]. Moreover, learned features could be helpful even when a pre-training task is different from a target task [74].

Recently, the concept of self-supervised learning, which is closely related to pre-training and transfer learning, has seen an increase in popularity [75]. Self-supervised learning leverages large amounts of unlabeled data and uses naturally available information as labels for supervised learning. Thus, self-supervised learning is sometimes also described as autonomous supervised learning. Using self-supervised learning, a model can be pre-trained on a related task before it is trained on the target task. Another related approach is multi-task learning, which simultaneously trains a network for multiple separate tasks that share features. In fact, multi-task learning can be used separately or even in combination with transfer learning [76].

This tip can be distilled into two main action points: first, base the network’s architecture on knowledge of the problem and, second, take advantage of similar existing data or pre-trained deep learning models.

Tip 6: Tune your hyperparameters extensively and systematically

Given at least one hidden layer, a non-linear activation function, and a large number of hidden units, multi-layer neural networks can approximate arbitrary continuous functions that relate input and output variables [15,77]. Deeper architectures that feature additional hidden layers and an increasing number of overall hidden units and learnable weight parameters (the so-called increasing “capacity” of neural networks) allow for solving increasingly complex problems. However, this increased capacity results in many more parameters to fit and hyperparameters to tune, which can pose additional challenges during model training. In general, one should expect to systematically evaluate the impact of numerous hyperparameters when applying deep neural networks to new data or challenges. Hyperparameters typically manifest as choices of optimization algorithms, loss function, learning rate, activation functions, number of hidden layers and hidden units, size of the training batches, and weight initialization schemes. Moreover, additional hyperparameters are introduced by common techniques that facilitate training deeper architectures. These include regularization penalties, dropout [78], and batch normalization [79], which can reduce the effect of the so-called vanishing or exploding gradient problem when working with deep neural networks.

This wide array of potential parameters can make it difficult to evaluate the extent to which neural network methods are well suited to solving a task. It can be unclear to practitioners whether previous successful applications were the result of interactions between unique data attributes and specific hyperparameter settings. A lack of clarity on how extensive arrays of hyperparameters were tested or chosen can hamper method developers as they attempt to compare techniques. This effect also has implications for those seeking to use existing deep learning methods, as performance estimates from deep neural networks are often provided after tuning. The implication is that attaining performance numbers that match those reported in publications is likely to require significant effort towards temporally expensive hyperparameter optimization. Strategies for tuning hyperparameters include exhaustive grid search, random search, or Bayesian optimization and other specialized techniques. Tools such as Keras Tuner (https://keras-team.github.io/keras-tuner/) and Ray Tune (https://docs.ray.io/en/latest/tune/index.html) support best practices for hyperparmeter optimization.

To get the best performance from your model, be sure to systematically optimize your hyperparameters on your training dataset. Report both the selected hyperparameters and the hyperparameter optimization strategy.

Tip 7: Address deep neural networks’ increased tendency to overfit the dataset

Overfitting is a challenge inherent to machine learning in general and is one of the most significant challenges you’ll face when applying deep learning specifically. Overfitting occurs when a model fits patterns in the training data so closely that it includes non-generalizable noise or non-scientifically relevant perturbations in the relationships it learns. In other words, the model fits patterns that are overly specific to the data it is training on rather than learning general relationships that hold across similar datasets. This subtle distinction is made clearer by seeing what happens when a model is tested on data to which it was not exposed during training: just as a student who memorizes exam materials struggles to correctly answer questions for which they have not studied, a machine learning model that has overfit to its training data will perform poorly on unseen test data. Deep learning models are particularly susceptible to overfitting due to their relatively large number of parameters and associated representational capacity. Just as some students may have greater potential for memorization, deep learning models seem more prone to overfitting than machine learning models with fewer parameters. However, having a large number of parameters does not always imply that a neural network will overfit [80].

In general, one of the most effective ways to combat overfitting is to detect it in the first place. One way to do this is to split the main dataset being worked on into three independent parts: a training set, a tuning set (also commonly called a validation set in the machine learning literature), and a test set. These three partitions allow you to optimize models by iterating between model learning on the training set and hyperparameter evaluation on the tuning set without affecting the final model assessment on the test set. A researcher can compare the model’s performance on the training and tuning data to assess how overfit (i.e. non-generalizable) the model is. The data used for testing should be “locked away” and used only once to evaluate the final model after all training and tuning steps are completed. This type of approach is necessary for evaluating the generalizability of models without the biases that can arise from learning and testing on the same data [81,82]. While a slight drop in performance from the training set to the test set is normal, a significant drop is a clear sign of overfitting. See Figure 2 for a visual demonstration of an overfit model that performs poorly on test data.

There are a variety of techniques to reduce overfitting, including data augmentation and regularization techniques [83,84]. Another way to reduce overfitting, as described by Chuang and Keiser, is to identify the baseline level of memorization that is occurring by training on data that has its labels randomly shuffled [85]. By comparing the model performance with the shuffled data to that achieved with the actual data [85], a practitioner can identify overfitting as a model that performs no better on real data. This suggests that any predictive capacity is not due to data-driven signal. One important caveat when working with partitioned data is the need to apply transformation and normalization procedures equally to all datasets. The parameters required for such procedures (for example, quantile normalization, a common standardization method when analyzing gene-expression data) should only be derived from the training data, and not from the tuning or test data. Additionally, many conventional metrics for classification (e.g. area under the receiver operating characteristic curve or AUROC) have limited utility in cases of extreme class imbalance [86]. Therefore, model performance should be evaluated with a carefully picked panel of relevant metrics that make minimal assumptions about the composition of the testing data [87].

When working with biological and medical data, one must also carefully consider potential sources of bias and/or non-independence when defining training and test sets. For example, a deep learning model for pneumonia detection in chest X-rays appeared to performed well within the hospitals providing the training data, but then failed to generalize to other hospitals [88]. This resulted from the deep learning model picking up on signal related to which hospital the images were from and represents a type of artifact or “batch effect” that practitioners must be vigilant towards. When dealing with sequence data, holding out test data that are evolutionarily related or that share structural homology to the training data can result in overfitting that is hard to detect due to the inherent relatedness of the partitioned data [89]. In such situations, simply holding out test data selected from a random partition of the training data can be insufficient. The best remedy for identifying confounding variables is to know your data and to test models on truly independent data.

In essence, practitioners should split data into training, tuning, and single-use testing sets to assess the performance of the model on data that can provide a reliable estimate of its generalization performance. Furthermore, be cognizant of the danger of skewed or biased data artificially inflating performance.

Tip 8: Deep learning models can be made more transparent

While model interpretability is a broad concept, in much of the machine learning literature it refers to the ability to identify the discriminative features that influence or sway the predictions. In certain cases, the goal behind interpretation is to understand the underlying data generating processes and biological mechanisms [90]. In other cases, the goal is to understand why a model made the prediction that it did for a specific example or set of examples. Machine learning models vary widely in terms of interpretability: some are fully transparent while others are considered “black-boxes” that make predictions with little ability to examine why. Logistic regression and decision tree models are generally considered interpretable [91]. In contrast, deep neural networks are often considered among the most difficult to interpret naively because they can have many parameters and non-linear relationships.

Knowing which of the input variables influences a model’s predictions, and potentially in what ways, can help with the application or extrapolation of machine learning models. This is particularly important in biomedicine. Subsequent decision making often requires human input, and models are employed with the hope of better understanding why relationships exist in the first place. Furthermore, while prediction rules can be derived from high-throughput molecular datasets, most affordable clinical tests still rely on lower-dimensional measurements of a limited number of biomarkers. Therefore, it is often unclear how to translate the predictive capacity of deep learning models that encompass non-linear relationships between countless input variables into clinically digestible terms. As a result, selecting which biomarkers to use for decision making remains an important modeling and interpretation challenge. In fact, many authors attribute a lower uptake of deep learning tools in healthcare to interpretability challenges [92,93]. Nonetheless, strategies to interpret both machine learning and deep learning models are rapidly emerging, and the literature on the topic is growing exponentially [94]. Instead of recommending specific methods for either deep learning-specific or general-purpose model interpretation, we suggest consulting a freely available and continually updated textbook [95].

Tip 9: Don’t over-interpret predictions

After training an accurate deep learning model, it is natural to want to use it to deduce relationships and inform scientific findings. However, be careful to interpret the model’s predictions correctly. Given that deep learning models can be difficult to interpret intuitively, there is often a temptation to over-interpret the predictions in indulgent or inaccurate ways. In accordance with the classic statistical saying “correlation doesn’t imply causation,” predictions by deep learning models rarely provide causal relationships. Accurately predicting an outcome does not a causal mechanism has been learned, even when predictions are extremely accurate. In a poignant example, authors evaluated the capacities of several models to predict the probability of death for patients with pneumonia admitted to an intensive care unit [96,97]. The neural network model achieved the best predictive accuracy. However, after fitting a rule-based model to understand the relationships inherent to their data better, the authors discovered that the hospital data implied the rule “$\text{HasAsthma}(x) \Rightarrow \text{LowerRisk}(x)$.” This rule contradicts medical understanding, as having asthma does not make pneumonia better! Nonetheless, the data supported this rule, as pneumonia patients with a history of asthma tended to receive more aggressive care. The neural network had, therefore, also learned to make predictions according to this rule despite the fact that it has nothing to do with causality or mechanism. Therefore, it would have been disastrous to guide treatment decisions according to the predictions of the neural network, even though the neural network had high predictive accuracy.

Avoid over-interpreting deep learning models by viewing them for what they are: complex statistical models trained on high dimensional data. If causal inference is desired, special techniques for causal inference are required [98].

Tip 10: Actively consider the ethical implications of your work

While deep learning continues to be a powerful, transformative tool within life sciences research—spanning basic biology and pre-clinical science to varied translational approaches and clinical studies—it is important to comment on ethical considerations. For instance, despite the fact that deep learning methods are helping to increase medical efficiency through improved diagnostic capability and risk assessment, certain biases may be inadvertently introduced into models related to patient age, race, and gender [99]. Deep learning practitioners may make use of datasets not representative of diverse populations and patient characteristics [100], thereby contributing to these problems.

Therefore, it is important to think thoroughly and cautiously about deep learning applications and their potential impact to persons and society—mindful of possible harms, injuries, injustices, and other types of wrongdoings. At a minimum, practitioners must ensure that, wherever relevant, their life sciences projects are fully compliant with local research governance and approval policies, legal requirements, Institutional Review Board policies, and any other relevant bodies and standards. Moreover, we offer below three tangible, action-oriented recommendations to further empower and enrichen deep learning researchers.

First, just as it is a best practice to keep a project-specific or programming-related issue tracker detailing known bugs and other technical issues, practitioners should get into the habit of keeping an active ethics register. In this register, ethical concerns can be raised, recorded, and resolved, exactly as software problems are triaged and fixed. Because projects using deep learning usually rely on writing code, an ethics register can be a part of the issue tracker in the version control system for the software itself. By colocating the two, practitioners can operationalize the concept that ethical problems are “bugs” that must be resolved, not nice-to-haves that can be considered at some indefinite point in the future. For practitioners intending to distribute trained models, having an ethics register can also facilitate creating a model card [101], a short document specifying the domains in which the model’s performance was validated (for example, which model organism was used), how the performance was benchmarked, and known limitations and concerns. Second, to help foster a conscious ethics-oriented mindset, researchers should consider expanding journal clubs to include scholarly and popular articles detailing real-world ethics issues relevant to their scientific fields. This will help researchers to think more holistically and judiciously about their work and its implications. Third, we encourage individual- and team-level participation in professional societies [102] and other types of organizations [103] and events [104] related to the domains of AI and data ethics as well as bioethics. This will encourage a sense of community and intellectual engagement, keeping practitioners abreast of cutting-edge insights and emerging professional standards.

Furthermore, practitioners may encounter datasets that cannot be shared, such as ones for which there would be significant ethical or legal issues associated with their release [105]. Examples of such data include classified or confidential data, biological data related to trade secrets, medical records, or other personally identifiable information [106]. While deep learning models can capture information-rich abstractions of multiple features of the data during the training process, these features may be more prone to leak the data that they were trained over if the model is shared or allowed to be queried with arbitrary inputs [107,108]. In other words, the complex relationships learned about the input data can potentially be used to infer characteristics about the original dataset. The strengths that imbue deep learnings with its great predictive capacity also raise the level of risk surrounding data privacy. Therefore, while there is tremendous promise for deep learning techniques to extract information that cannot readily be captured by traditional methods [109], it is imperative not to share models trained on sensitive data. This also holds true for certain traditional machine learning methods that learn by capturing specific details of the full training data (for example, k-nearest neighbors models).

Techniques to train deep neural networks without sharing unencrypted access to data are being advanced through implementations of homomorphic encryption, which serves to enable equivalent prediction on data that is encrypted end-to-end [110,111]. Privacy-preserving techniques [112], such as differential privacy [113,114,115], can help to mitigate risks as long as the assumptions underlying these techniques are met. These methods provide a path towards a future where trained models and their predictions can be shared, but more software development and theoretical advances will be required to make these techniques easy to apply correctly in many settings. Unless you use these techniques, do not share the weights or provide arbitrary access to the predictions of models trained on sensitive data.

Conclusion

Collectively, our manuscript is focused on the promotion of practical tips distilled from cutting-edge insights and evolving professional standards to advance the efficient and optimal application of deep learning within research. It is evident that some of our points (see Tips 7, 8, 9, and 10) are intimately linked to safeguarding against key risks: for example, introduction/perpetuation of bias, overinterpretation/misinterpretation of models, poor generalizability, and potential for harm unto others—which can have a mix of ethical, legal, and social implications. If leveraged in ethical and responsible ways, deep learning techniques have the potential to add value within a diverse array of research and healthcare contexts, as these techniques have already shown remarkable capacity to meet or exceed the performance of human effort or older algorithms across fields and subdisciplines. Beyond merely achieving good predictive performance in certain tasks, deep learning has the potential to uncover high-impact biological and clinical insights, fundamentally driving research discoveries and delivery of new products to market. Yet, to realize its full potential, deep learning must be approached by all with genuine thoughtfulness, caution, and responsibility.

Through the tips and recommendations provided within this manuscript, we hope to encourage a prudent, vigilant community of computational practitioners, experimental biologists, and clinical scientists: colleagues who, before excitedly stitching together lines of code and datasets, first pause to think, dialogue, plan, and discern how their work might have far-reaching consequences with ethical dimensions. This holistic approach will help us to advance accountability, beneficence, and quality in science.

Thus, we aim not only to increase the accessibility of deep learning techniques within the life sciences, but also to improve upon the reproducibility and interpretability of high-quality deep learning research in the literature and scientific community—especially given that published findings, models, and datasets will be leveraged to yield innovative tools, services, and products in the marketplace. Indeed, we hope that these tips will serve as a powerful engine for promoting meaningful discussions, reflections, team learnings, and best practices to drive collaboration that fosters cutting-edge deep learning innovation, sensibly and responsibly.

Acknowledgements

The authors would like the thank Daniel Himmelstein and the developers of Manubot for creating the software that enabled the collaborative composition of this manuscript. We would also like to thank [TODO: insert the names of the contributors who don’t meet the standards for authorship] for their contributions to the discussions that comprised the initial stage of the drafting process.

References

1. Backpropagation and the brain
Timothy P. Lillicrap, Adam Santoro, Luke Marris, Colin J. Akerman, Geoffrey Hinton
Nature Reviews Neuroscience (2020-04-17) https://doi.org/ggsc7t
DOI: 10.1038/s41583-020-0277-3 · PMID: 32303713

2. Opportunities and obstacles for deep learning in biology and medicine
Travers Ching, Daniel S. Himmelstein, Brett K. Beaulieu-Jones, Alexandr A. Kalinin, Brian T. Do, Gregory P. Way, Enrico Ferrero, Paul-Michael Agapow, Michael Zietz, Michael M. Hoffman, … Casey S. Greene
Journal of The Royal Society Interface (2018-04-04) https://doi.org/gddkhn
DOI: 10.1098/rsif.2017.0387 · PMID: 29618526 · PMCID: PMC5938574

3. VAMPnets for deep learning of molecular kinetics
Andreas Mardt, Luca Pasquali, Hao Wu, Frank Noé
Nature Communications (2018-01-02) https://doi.org/gcvf62
DOI: 10.1038/s41467-017-02388-1 · PMID: 29295994 · PMCID: PMC5750224

4. Deep learning to predict the lab-of-origin of engineered DNA
Alec A. K. Nielsen, Christopher A. Voigt
Nature Communications (2018-08-07) https://doi.org/gd27sw
DOI: 10.1038/s41467-018-05378-z · PMID: 30087331 · PMCID: PMC6081423

5. Identifying facial phenotypes of genetic disorders using deep learning
Yaron Gurovich, Yair Hanani, Omri Bar, Guy Nadav, Nicole Fleischer, Dekel Gelbman, Lina Basel-Salmon, Peter M. Krawitz, Susanne B. Kamphausen, Martin Zenker, … Karen W. Gripp
Nature Medicine (2019-01-07) https://doi.org/czdm
DOI: 10.1038/s41591-018-0279-0 · PMID: 30617323

6. Benjamin-Lee/deep-rules GitHub repository
Benjamin Lee
GitHub (2018) https://github.com/Benjamin-Lee/deep-rules

7. Open collaborative writing with Manubot
Daniel S. Himmelstein, Vincent Rubinetti, David R. Slochower, Dongbo Hu, Venkat S. Malladi, Casey S. Greene, Anthony Gitter
PLOS Computational Biology (2019-06-24) https://doi.org/c7np
DOI: 10.1371/journal.pcbi.1007128 · PMID: 31233491 · PMCID: PMC6611653

8. PYTHON MACHINE LEARNING - THIRD EDITION: machine learning and deep learning with python, scikit …-learn, and tensorflow 2.
SEBASTIAN. MIRJALILI RASCHKA VAHID
PACKT Publishing Limited (2019)
ISBN: 9781789955750

9. Ten quick tips for machine learning in computational biology
Davide Chicco
BioData Mining (2017-12-08) https://doi.org/gdb9wr
DOI: 10.1186/s13040-017-0155-3 · PMID: 29234465 · PMCID: PMC5721660

10. The Secrets of Machine Learning: Ten Things You Wish You Had Known Earlier to be More Effective at Data Analysis
Cynthia Rudin, David Carlson
arXiv (2019-06-06) https://arxiv.org/abs/1906.01998

11. Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integration in Precision Medicine
Dmitry Grapov, Johannes Fahrmann, Kwanjeera Wanichthanarak, Sakda Khoomrung
OMICS: A Journal of Integrative Biology (2018-10) https://doi.org/gfjjgn
DOI: 10.1089/omi.2018.0097 · PMID: 30124358 · PMCID: PMC6207407

12. Deep Learning Techniques: An Overview
Amitha Mathew, P. Amudha, S. Sivakumari
Advances in Intelligent Systems and Computing (2021) https://doi.org/ghjtg6
DOI: 10.1007/978-981-15-3383-9_54

13. Machine Learning in Python: Main Developments and Technology Trends in Data Science, Machine Learning, and Artificial Intelligence
Sebastian Raschka, Joshua Patterson, Corey Nolet
Information (2020-04-04) https://doi.org/ghjtg8
DOI: 10.3390/info11040193

14. Approximation by superpositions of a sigmoidal function
G. Cybenko
Mathematics of Control, Signals, and Systems (1989-12) https://doi.org/dp3968
DOI: 10.1007/bf02551274

15. Approximation capabilities of multilayer feedforward networks
Kurt Hornik
Neural Networks (1991) https://doi.org/dzwxkd
DOI: 10.1016/0893-6080(91)90009-t

16. Data Programming: Creating Large Training Sets, Quickly
Alexander Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, Christopher Ré
arXiv (2016-05-25) https://arxiv.org/abs/1605.07723v3

17. How much data is needed to train a medical image deep learning system to achieve necessary high accuracy?
Junghwan Cho, Kyewook Lee, Ellie Shin, Garry Choy, Synho Do
arXiv (2016-01-11) https://arxiv.org/abs/1511.06348

18. Efficient Processing of Deep Neural Networks: A Tutorial and Survey
Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, Joel S. Emer
Proceedings of the IEEE (2017-12) https://doi.org/gcnp38
DOI: 10.1109/jproc.2017.2761740

19. Language Models are Few-Shot Learners
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, … Dario Amodei
arXiv (2020-07-24) https://arxiv.org/abs/2005.14165

20. Energy and Policy Considerations for Deep Learning in NLP
Emma Strubell, Ananya Ganesh, Andrew McCallum
arXiv (2019-06-07) https://arxiv.org/abs/1906.02243

21. ProGen: Language Modeling for Protein Generation
Ali Madani, Bryan McCann, Nikhil Naik, Nitish Shirish Keskar, Namrata Anand, Raphael R. Eguchi, Po-Ssu Huang, Richard Socher
arXiv (2020-04-08) https://arxiv.org/abs/2004.03497

22. ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing
Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rihawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, … Burkhard Rost
arXiv (2020-07-22) https://arxiv.org/abs/2007.06225

23. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, Rob Fergus
Cold Spring Harbor Laboratory (2020-12-15) https://doi.org/gf2x4p
DOI: 10.1101/622803

24. A Machine Learning Driven IoT Solution for Noise Classification in Smart Cities
Yasser Alsouda, Sabri Pllana, Arianit Kurti
arXiv (2018-09-05) https://arxiv.org/abs/1809.00238

25. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, … Xiaoqiang Zheng
arXiv (2016-03-17) https://arxiv.org/abs/1603.04467

26. PyTorch: An Imperative Style, High-Performance Deep Learning Library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, … Soumith Chintala
arXiv (2019-12-05) https://arxiv.org/abs/1912.01703

27. Scikit-learn: Machine Learning in Python
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, … Édouard Duchesnay
Journal of Machine Learning Research (2011) http://jmlr.org/papers/v12/pedregosa11a.html

28. Automating Biomedical Data Science Through Tree-Based Pipeline Optimization
Randal S. Olson, Ryan J. Urbanowicz, Peter C. Andrews, Nicole A. Lavender, La Creis Kidd, Jason H. Moore
Lecture Notes in Computer Science (2016) https://doi.org/ggfptv
DOI: 10.1007/978-3-319-31204-0_9

29. apple/turicreate
Apple
(2021-02-01) https://github.com/apple/turicreate

30. Auto-Keras: An Efficient Neural Architecture Search System
Haifeng Jin, Qingquan Song, Xia Hu
arXiv (2019-03-27) https://arxiv.org/abs/1806.10282

31. Keras: the Python deep learning API https://keras.io/

32. Fastai: A Layered API for Deep Learning
Jeremy Howard, Sylvain Gugger
Information (2020-02-16) https://doi.org/ggmbms
DOI: 10.3390/info11020108

33. ImageNet classification with deep convolutional neural networks
Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton
Communications of the ACM (2017-05-24) https://doi.org/gbhhxs
DOI: 10.1145/3065386

34. A Deep Learning Approach for Predicting Antidepressant Response in Major Depression Using Clinical and Genetic Biomarkers
Eugene Lin, Po-Hsiu Kuo, Yu-Li Liu, Younger W.-Y. Yu, Albert C. Yang, Shih-Jen Tsai
Frontiers in Psychiatry (2018-07-06) https://doi.org/gdv7r2
DOI: 10.3389/fpsyt.2018.00290 · PMID: 30034349 · PMCID: PMC6043864

35. Deep learning with convolutional neural network in radiology
Koichiro Yasaka, Hiroyuki Akai, Akira Kunimatsu, Shigeru Kiryu, Osamu Abe
Japanese Journal of Radiology (2018-03-01) https://doi.org/ggb3tf
DOI: 10.1007/s11604-018-0726-3 · PMID: 29498017

36. Deep learning microscopy
Yair Rivenson, Zoltán Göröcs, Harun Günaydin, Yibo Zhang, Hongda Wang, Aydogan Ozcan
Optica (2017-11-20) https://doi.org/gf8dhj
DOI: 10.1364/optica.4.001437

37. Deep learning for pharmacovigilance: recurrent neural network architectures for labeling adverse drug reactions in Twitter posts
Anne Cocos, Alexander G Fiks, Aaron J Masino
Journal of the American Medical Informatics Association (2017-07) https://doi.org/gbp9nj
DOI: 10.1093/jamia/ocw180 · PMID: 28339747 · PMCID: PMC7651964

38. Deep learning‐based methods for individual recognition in small birds
André C. Ferreira, Liliana R. Silva, Francesco Renna, Hanja B. Brandl, Julien P. Renoult, Damien R. Farine, Rita Covas, Claire Doutrelant
Methods in Ecology and Evolution (2020-07-26) https://doi.org/d438
DOI: 10.1111/2041-210x.13436

39. Deep generative models: Survey
Achraf Oussidi, Azeddine Elhassouny
Institute of Electrical and Electronics Engineers (IEEE) (2018-04) https://doi.org/ghjtg7
DOI: 10.1109/isacv.2018.8354080

40. Correcting for experiment-specific variability in expression compendia can remove underlying signals
Alexandra J Lee, YoSon Park, Georgia Doing, Deborah A Hogan, Casey S Greene
GigaScience (2020-11-03) https://doi.org/ghhtpf
DOI: 10.1093/gigascience/giaa117 · PMID: 33140829 · PMCID: PMC7607552

41. Deep Reinforcement Learning that Matters
Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, David Meger
arXiv (2019-01-31) https://arxiv.org/abs/1709.06560

42. Optimization of Molecules via Deep Reinforcement Learning
Zhenpeng Zhou, Steven Kearnes, Li Li, Richard N. Zare, Patrick Riley
Scientific Reports (2019-07-24) https://doi.org/ggfqc8
DOI: 10.1038/s41598-019-47148-x · PMID: 31341196 · PMCID: PMC6656766

43. Scalable and accurate deep learning with electronic health records
Alvin Rajkomar, Eyal Oren, Kai Chen, Andrew M. Dai, Nissan Hajaj, Michaela Hardt, Peter J. Liu, Xiaobing Liu, Jake Marcus, Mimi Sun, … Jeffrey Dean
npj Digital Medicine (2018-05-08) https://doi.org/gdqcc8
DOI: 10.1038/s41746-018-0029-1 · PMID: 31304302 · PMCID: PMC6550175

44. Deep-learning: investigating deep neural networks hyper-parameters and comparison of performance to shallow methods for modeling bioactivity data
Alexios Koutsoukas, Keith J. Monaghan, Xiaoli Li, Jun Huan
Journal of Cheminformatics (2017-06-28) https://doi.org/gfwv4d
DOI: 10.1186/s13321-017-0226-y · PMID: 29086090 · PMCID: PMC5489441

45. Deep learning and alternative learning strategies for retrospective real-world clinical data
David Chen, Sijia Liu, Paul Kingsbury, Sunghwan Sohn, Curtis B. Storlie, Elizabeth B. Habermann, James M. Naessens, David W. Larson, Hongfang Liu
npj Digital Medicine (2019-05-30) https://doi.org/ghfwhh
DOI: 10.1038/s41746-019-0122-0 · PMID: 31304389 · PMCID: PMC6550223

46. Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning
Nicolas Papernot, Patrick McDaniel
arXiv (2018-03-14) https://arxiv.org/abs/1803.04765

47. To Trust Or Not To Trust A Classifier
Heinrich Jiang, Been Kim, Melody Y. Guan, Maya Gupta
arXiv (2018-10-30) https://arxiv.org/abs/1805.11783

48. Parameter tuning is a key part of dimensionality reduction via deep variational autoencoders for single cell RNA transcriptomics
Qiwen Hu, Casey S. Greene
World Scientific Pub Co Pte Lt (2018-11) https://doi.org/gf5pc7
DOI: 10.1142/9789813279827_0033

49. Ten Simple Rules for Taking Advantage of Git and GitHub
Yasset Perez-Riverol, Laurent Gatto, Rui Wang, Timo Sachsenberg, Julian Uszkoreit, Felipe da Veiga Leprevost, Christian Fufezan, Tobias Ternent, Stephen J. Eglen, Daniel S. Katz, … Juan Antonio Vizcaíno
PLOS Computational Biology (2016-07-14) https://doi.org/gbrb39
DOI: 10.1371/journal.pcbi.1004947 · PMID: 27415786 · PMCID: PMC4945047

50. Reproducibility of computational workflows is automated using continuous analysis
Brett K Beaulieu-Jones, Casey S Greene
Nature Biotechnology (2017-03-13) https://doi.org/f9ttx6
DOI: 10.1038/nbt.3780 · PMID: 28288103 · PMCID: PMC6103790

51. Ten Simple Rules for Reproducible Computational Research
Geir Kjetil Sandve, Anton Nekrutenko, James Taylor, Eivind Hovig
PLoS Computational Biology (2013-10-24) https://doi.org/pjb
DOI: 10.1371/journal.pcbi.1003285 · PMID: 24204232 · PMCID: PMC3812051

52. Ten Simple Rules for Reproducible Research in Jupyter Notebooks
Adam Rule, Amanda Birmingham, Cristal Zuniga, Ilkay Altintas, Shih-Cheng Huang, Rob Knight, Niema Moshiri, Mai H. Nguyen, Sara Brin Rosenthal, Fernando Pérez, Peter W. Rose
arXiv (2018-10-19) https://arxiv.org/abs/1810.08055

53. Deep Learning SDK Documentation
NVIDIA
(2018-11-01) https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#reproducibility

54. Phonetic Classification and Recognition Using the Multi-Layer Perceptron
Hong Leung, James Glass, Michael Phillips, Victor W. Zue
Advances in Neural Information Processing Systems (1991) https://proceedings.neurips.cc/paper/1990/file/3dd48ab31d016ffcbf3314df2b3cb9ce-Paper.pdf

55. The Dryad Digital Repository: Published evolutionary data as part of the greater data ecosystem
Todd Vision
Nature Precedings (2010-06-30) https://doi.org/ghk2km
DOI: 10.1038/npre.2010.4595.1

56. FigShare
Jatinder Singh
Journal of Pharmacology and Pharmacotherapeutics (2011) https://doi.org/cvqv67
DOI: 10.4103/0976-500x.81919 · PMID: 21772785 · PMCID: PMC3127351

57. Zenodo, an Archive and Publishing Repository: A tale of two herbarium specimen pilot projects
Mathias Dillen, Quentin Groom, Donat Agosti, Lars Nielsen
Biodiversity Information Science and Standards (2019-06-18) https://doi.org/ghk2kn
DOI: 10.3897/biss.3.37080

58. Open Science Framework (OSF)
Erin D. Foster, MSLS, Ariel Deardorff, MLIS
Journal of the Medical Library Association (2017-04-04) https://doi.org/gfxvhq
DOI: 10.5195/jmla.2017.88 · PMCID: PMC5370619

59. On Reproducible AI: Towards Reproducible Research, Open Science, and Digital Scholarship in AI Publications
Odd Erik Gundersen, Yolanda Gil, David W. Aha
AI Magazine (2018-09-28) https://doi.org/gfcqt6
DOI: 10.1609/aimag.v39i3.2816

60. Minimum information about a microarray experiment (MIAME)—toward standards for microarray data
Alvis Brazma, Pascal Hingamp, John Quackenbush, Gavin Sherlock, Paul Spellman, Chris Stoeckert, John Aach, Wilhelm Ansorge, Catherine A. Ball, Helen C. Causton, … Martin Vingron
Nature Genetics (2001-12) https://doi.org/ck257n
DOI: 10.1038/ng1201-365 · PMID: 11726920

61. Tackling the widespread and critical impact of batch effects in high-throughput data
Jeffrey T. Leek, Robert B. Scharpf, Héctor Corrada Bravo, David Simcha, Benjamin Langmead, W. Evan Johnson, Donald Geman, Keith Baggerly, Rafael A. Irizarry
Nature Reviews Genetics (2010-09-14) https://doi.org/cfr324
DOI: 10.1038/nrg2825 · PMID: 20838408 · PMCID: PMC3880143

62. Neural Networks: Tricks of the Trade
Lecture Notes in Computer Science
(2012) https://doi.org/gfvtvt
DOI: 10.1007/978-3-642-35289-8

63. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
Shaojie Bai, J. Zico Kolter, Vladlen Koltun
arXiv (2018-04-20) https://arxiv.org/abs/1803.01271

64. Attention Is All You Need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
arXiv (2017-12-07) https://arxiv.org/abs/1706.03762

65. Machine learning and AI-based approaches for bioactive ligand discovery and GPCR-ligand recognition
Sebastian Raschka, Benjamin Kaufman
Methods (2020-08) https://doi.org/ghk2mf
DOI: 10.1016/j.ymeth.2020.06.016 · PMID: 32645448

66. Deep learning
Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Nature (2015-05-27) https://doi.org/bmqp
DOI: 10.1038/nature14539 · PMID: 26017442

67. How transferable are features in deep neural networks?
Jason Yosinski, Jeff Clune, Yoshua Bengio, Hod Lipson
Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 (2014-12-08) https://dl.acm.org/doi/abs/10.5555/2969033.2969197

68. ImageNet Large Scale Visual Recognition Challenge
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, … Li Fei-Fei
International Journal of Computer Vision (2015-04-11) https://doi.org/gcgk7w
DOI: 10.1007/s11263-015-0816-y

69. High-Throughput Classification of Radiographs Using Deep Convolutional Neural Networks
Alvin Rajkomar, Sneha Lingam, Andrew G. Taylor, Michael Blum, John Mongan
Journal of Digital Imaging (2016-10-11) https://doi.org/gcgk7v
DOI: 10.1007/s10278-016-9914-9 · PMID: 27730417 · PMCID: PMC5267603

70. Kipoi: accelerating the community exchange and reuse of predictive models for genomics
Žiga Avsec, Roman Kreuzhuber, Johnny Israeli, Nancy Xu, Jun Cheng, Avanti Shrikumar, Abhimanyu Banerjee, Daniel S. Kim, Lara Urban, Anshul Kundaje, … Julien Gagneur
Cold Spring Harbor Laboratory (2018-07-24) https://doi.org/gd24sx
DOI: 10.1101/375345

71. Transformers: State-of-the-Art Natural Language Processing
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, … Alexander Rush
Association for Computational Linguistics (ACL) (2020) https://doi.org/ghs3bd
DOI: 10.18653/v1/2020.emnlp-demos.6

72. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing
Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, Hoifung Poon
arXiv (2020-08-24) https://arxiv.org/abs/2007.15779

73. ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction
Seyone Chithrananda, Gabriel Grand, Bharath Ramsundar
arXiv (2020-10-26) https://arxiv.org/abs/2010.09885

74. CNN Features Off-the-Shelf: An Astounding Baseline for Recognition
Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, Stefan Carlsson
Institute of Electrical and Electronics Engineers (IEEE) (2014-06) https://doi.org/f3np4s
DOI: 10.1109/cvprw.2014.131

75. Fast and robust segmentation of white blood cell images by self-supervised learning
Xin Zheng, Yong Wang, Guoyou Wang, Jianguo Liu
Micron (2018-04) https://doi.org/gdfh65
DOI: 10.1016/j.micron.2018.01.010 · PMID: 29425969

76. Deep Model Based Transfer and Multi-Task Learning for Biological Image Analysis
Wenlu Zhang, Rongjian Li, Tao Zeng, Qian Sun, Sudhir Kumar, Jieping Ye, Shuiwang Ji
IEEE Transactions on Big Data (2020-06-01) https://doi.org/gfvs28
DOI: 10.1109/tbdata.2016.2573280

77. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function
Moshe Leshno, Vladimir Ya. Lin, Allan Pinkus, Shimon Schocken
Neural Networks (1993-01) https://doi.org/bjjdg2
DOI: 10.1016/s0893-6080(05)80131-5

78. Dropout: a simple way to prevent neural networks from overfitting
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov
The Journal of Machine Learning Research (2014-01-01) http://dl.acm.org/citation.cfm?id=2670313

79. Batch normalization: accelerating deep network training by reducing internal covariate shift
Sergey Ioffe, Christian Szegedy
Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (2015-07-06) https://dl.acm.org/citation.cfm?id=3045118.3045167

80. Reconciling modern machine-learning practice and the classical bias–variance trade-off
Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal
Proceedings of the National Academy of Sciences (2019-08-06) https://doi.org/gf5dmw
DOI: 10.1073/pnas.1903070116 · PMID: 31341078 · PMCID: PMC6689936

81. Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning
Sebastian Raschka
arXiv (2020-11-12) https://arxiv.org/abs/1811.12808

82. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms
Thomas G. Dietterich
Neural Computation (1998-10-01) https://doi.org/fqc9w5
DOI: 10.1162/089976698300017197 · PMID: 9744903

83. Dropout: A Simple Way to Prevent Neural Networks from Overfitting
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov
Journal of Machine Learning Research (2014) http://jmlr.org/papers/v15/srivastava14a.html

84. A simple weight decay can improve generalization
Anders Krogh, John A. Hertz
Proceedings of the 4th International Conference on Neural Information Processing Systems (1991-12-02) http://dl.acm.org/citation.cfm?id=2986916.2987033
ISBN: 9781558602229

85. Adversarial Controls for Scientific Machine Learning
Kangway V. Chuang, Michael J. Keiser
ACS Chemical Biology (2018-10-19) https://doi.org/gfk9mh
DOI: 10.1021/acschembio.8b00881 · PMID: 30336670

86. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets.
Takaya Saito, Marc Rehmsmeier
PloS one (2015-03-04) https://www.ncbi.nlm.nih.gov/pubmed/25738806
DOI: 10.1371/journal.pone.0118432 · PMID: 25738806 · PMCID: PMC4349800

87. Comparison of Deep Learning With Multiple Machine Learning Methods and Metrics Using Diverse Drug Discovery Data Sets
Alexandru Korotcov, Valery Tkachenko, Daniel P. Russo, Sean Ekins
Molecular Pharmaceutics (2017-11-13) https://doi.org/gcj4p2
DOI: 10.1021/acs.molpharmaceut.7b00578 · PMID: 29096442 · PMCID: PMC5741413

88. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study
John R. Zech, Marcus A. Badgeley, Manway Liu, Anthony B. Costa, Joseph J. Titano, Eric Karl Oermann
PLOS Medicine (2018-11-06) https://doi.org/gfj53h
DOI: 10.1371/journal.pmed.1002683 · PMID: 30399157 · PMCID: PMC6219764

89. Correct machine learning on protein sequences: a peer-reviewing perspective
Ian Walsh, Gianluca Pollastri, Silvio C. E. Tosatto
Briefings in Bioinformatics (2016-09) https://doi.org/f89ms7
DOI: 10.1093/bib/bbv082 · PMID: 26411473

90. Machine Learning to Identify Flexibility Signatures of Class A GPCR Inhibition
Joseph Bemister-Buffington, Alex J. Wolf, Sebastian Raschka, Leslie A. Kuhn
Biomolecules (2020-03-14) https://doi.org/ghm636
DOI: 10.3390/biom10030454 · PMID: 32183371 · PMCID: PMC7175283

91. Automated Inference of Chemical Discriminants of Biological Activity
Sebastian Raschka, Anne M. Scott, Mar Huertas, Weiming Li, Leslie A. Kuhn
Methods in Molecular Biology (2018) https://doi.org/ghk2pg
DOI: 10.1007/978-1-4939-7756-7_16 · PMID: 29594779

92. Deep Learning for Health Informatics
Daniele Ravi, Charence Wong, Fani Deligianni, Melissa Berthelot, Javier Andreu-Perez, Benny Lo, Guang-Zhong Yang
IEEE Journal of Biomedical and Health Informatics (2017-01) https://doi.org/gfgtzx
DOI: 10.1109/jbhi.2016.2636665 · PMID: 28055930

93. Towards trustable machine learning
Nature Biomedical Engineering
(2018-10-10) https://doi.org/gfw9cn
DOI: 10.1038/s41551-018-0315-x · PMID: 31015650

94. On Interpretability of Artificial Neural Networks: A Survey
Fenglei Fan, Jinjun Xiong, Mengzhou Li, Ge Wang
arXiv (2020-12-02) https://arxiv.org/abs/2001.02522

95. Interpretable Machine Learning
Christoph Molnar
https://christophm.github.io/interpretable-ml-book/

96. An evaluation of machine-learning methods for predicting pneumonia mortality
Gregory F. Cooper, Constantin F. Aliferis, Richard Ambrosino, John Aronis, Bruce G. Buchanan, Richard Caruana, Michael J. Fine, Clark Glymour, Geoffrey Gordon, Barbara H. Hanusa, … Peter Spirtes
Artificial Intelligence in Medicine (1997-02) https://doi.org/b6vnmd
DOI: 10.1016/s0933-3657(96)00367-3

97. Intelligible Models for HealthCare
Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, Noemie Elhadad
Association for Computing Machinery (ACM) (2015-08-10) https://doi.org/gftgxk
DOI: 10.1145/2783258.2788613

98. When causal inference meets deep learning
Yunan Luo, Jian Peng, Jianzhu Ma
Nature Machine Intelligence (2020-08-12) https://doi.org/ghfwxq
DOI: 10.1038/s42256-020-0218-x

99. Deep Ethical Learning: Taking the Interplay of Human and Artificial Intelligence Seriously
Anita Ho
Hastings Center Report (2019-01) https://doi.org/ggsqtt
DOI: 10.1002/hast.977 · PMID: 30790317

100. The Legal And Ethical Concerns That Arise From Using Complex Predictive Analytics In Health Care
I. Glenn Cohen, Ruben Amarasingham, Anand Shah, Bin Xie, Bernard Lo
Health Affairs (2014-07) https://doi.org/f6dggf
DOI: 10.1377/hlthaff.2014.0048 · PMID: 25006139

101. Model Cards for Model Reporting
Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, Timnit Gebru
Association for Computing Machinery (ACM) (2019-01-29) https://doi.org/gftgjg
DOI: 10.1145/3287560.3287596

102. American Society for Bioethics and Humanities https://asbh.org/

103. 10 organizations leading the way in ethical AI — SAGE Ocean | Big Data, New Tech, Social Science (2021-01-12) https://web.archive.org/web/20210112231619/https://ocean.sagepub.com/blog/10-organizations-leading-the-way-in-ethical-ai

104. Artificial Intelligence, Ethics, and Society — Home https://www.aies-conference.com/2021/

105. Ten simple rules for responsible big data research
Matthew Zook, Solon Barocas, danah boyd, Kate Crawford, Emily Keller, Seeta Peña Gangadharan, Alyssa Goodman, Rachelle Hollander, Barbara A. Koenig, Jacob Metcalf, … Frank Pasquale
PLOS Computational Biology (2017-03-30) https://doi.org/gdqfcn
DOI: 10.1371/journal.pcbi.1005399 · PMID: 28358831 · PMCID: PMC5373508

106. Responsible, practical genomic data sharing that accelerates research
James Brian Byrd, Anna C. Greene, Deepashree Venkatesh Prasad, Xiaoqian Jiang, Casey S. Greene
Nature Reviews Genetics (2020-07-21) https://doi.org/gg7c57
DOI: 10.1038/s41576-020-0257-5 · PMID: 32694666

107. Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures
Matt Fredrikson, Somesh Jha, Thomas Ristenpart
Association for Computing Machinery (ACM) (2015-10-12) https://doi.org/cwdm
DOI: 10.1145/2810103.2813677

108. Membership Inference Attacks against Machine Learning Models
Reza Shokri, Marco Stronati, Congzheng Song, Vitaly Shmatikov
arXiv (2017-04-04) https://arxiv.org/abs/1610.05820

109. Convolutional Networks on Graphs for Learning Molecular Fingerprints
David Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael Gómez-Bombarelli, Timothy Hirzel, Alán Aspuru-Guzik, Ryan P. Adams
arXiv (2015-11-04) https://arxiv.org/abs/1509.09292

110. SIG-DB: Leveraging homomorphic encryption to securely interrogate privately held genomic databases
Alexander J. Titus, Audrey Flower, Patrick Hagerty, Paul Gamble, Charlie Lewis, Todd Stavish, Kevin P. O’Connell, Greg Shipley, Stephanie M. Rogers
PLOS Computational Biology (2018-09-04) https://doi.org/gd6xd5
DOI: 10.1371/journal.pcbi.1006454 · PMID: 30180163 · PMCID: PMC6138421

111. Towards the AlexNet Moment for Homomorphic Encryption: HCNN, theFirst Homomorphic CNN on Encrypted Data with GPUs
Ahmad Al Badawi, Jin Chao, Jie Lin, Chan Fook Mun, Jun Jie Sim, Benjamin Hong Meng Tan, Xiao Nan, Khin Mi Mi Aung, Vijay Ramaseshan Chandrasekhar
arXiv (2020-08-20) https://arxiv.org/abs/1811.00778

112. A generic framework for privacy preserving deep learning
Theo Ryffel, Andrew Trask, Morten Dahl, Bobby Wagner, Jason Mancuso, Daniel Rueckert, Jonathan Passerat-Palmbach
arXiv (2018-11-14) https://arxiv.org/abs/1811.04017

113. Deep Learning with Differential Privacy
Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, Li Zhang
Association for Computing Machinery (ACM) (2016-10-24) https://doi.org/gcrnp3
DOI: 10.1145/2976749.2978318

114. Privacy-preserving generative deep neural networks support clinical data sharing
Brett K. Beaulieu-Jones, Zhiwei Steven Wu, Chris Williams, Ran Lee, Sanjeev P. Bhavnani, James Brian Byrd, Casey S. Greene
Cold Spring Harbor Laboratory (2018-12-20) https://doi.org/gcnzrn
DOI: 10.1101/159756

115. Privacy-Preserving Distributed Deep Learning for Clinical Data
Brett K. Beaulieu-Jones, William Yuan, Samuel G. Finlayson, Zhiwei Steven Wu
arXiv (2018-12-05) https://arxiv.org/abs/1812.01484

Authors