Ten Quick Tips for Deep Learning in Biology

Benjamin D. Lee; Alexander J. Titus; Kun-Hsing Yu; Marc G. Chevrette; Paul Allen Stewart; Evan M. Cofer; Sebastian Raschka; Finlay Maguire; Benjamin J. Lengerich; Alexandr A. Kalinin; Anthony Gitter; Casey S. Greene; Simina M. Boca; Timothy J. Triche, Jr.; Thiago Britto-Borges; Elana J. Fertig; Michael D. Kessler; Alexandra J. Lee; Beth Signal; Juan Jose Carmona

Please note the current author order is chronological and does not reflect the final order.

Introduction

Machine learning is a modern approach to problem-solving and task automation. In particular, machine learning is concerned with the development and applications of algorithms that learn how to recognize patterns in data and utilize these for predictive modeling, as opposed to having domain experts developing rules for prediction tasks manually. Artificial neural networks are a particular class of machine learning algorithms and models that evolved into what we now describe as “deep learning” – that is, neural networks with many layers (and algorithms that make them perform well). These neural networks comprise artificial neurons arranged into layers and are modeled after the human brain, even though the building blocks and learning algorithms may differ [1]. Each layer receives input from previous layers (the first of which represents the input data) and then transmits a weighted version of its input to the subsequent layer. Thus, the process of “training” a neural network is the tuning of the layers’ weights to minimize a cost or loss function that serves as a differentiable surrogate of the prediction error. Deep learning utilizes artificial neural networks with many layers (hence the term “deep”). Given the computational advances made in the last decade, it can now be applied to massive data sets and in innumerable contexts. In many circumstances, deep learning can learn more complex relationships and make more accurate predictions than other methods. Therefore, deep learning has become its own subfield of machine learning. In the context of biological research, it has been increasingly used to derive novel insights from high-dimensional biological data [2]. For example, deep learning has been used to predict protein-drug binding kinetics [3], to identify the lab-of-origin of synthetic DNA [4], and to uncover the facial phenotypes of genetic disorders [5].

General resources communicating best practices to the scientific community broadly and the biological community specifically are scarce, and any resources that do exist are prone to reaching obsolescence rapidly due to deep learning’s active and specialized nature. In addition, the lack of established standards or concise recommendations for the application of deep learning to biological questions further limits newcomers from using state-of-the-art deep learning in their research.

To make deep learning more accessible to biological researchers, we solicited input from a community of researchers with varied biological and deep learning interests. These individuals collaboratively contributed to this manuscript’s writing using the GitHub version control platform [6] and the Manubot manuscript generation toolset [7]. The goal was to articulate a practical, accessible, and concise set of guidelines and suggestions for biologically oriented researchers to follow when using deep learning (Figure 1).

In the course of our discussions, several themes became clear: the importance of understanding and applying machine learning fundamentals [8] as a baseline for utilizing deep learning, the necessity for extensive model comparisons with careful evaluation, and the need for critical thought in interpreting results generated by deep learning, among others. The major similarities between deep learning and traditional computational methods also became apparent. Although deep learning is a distinct subfield of machine learning, it is still a subfield. It is subject to the many limitations inherent to machine learning, and many best practices for machine learning also apply to deep learning. In addition, as with all computational methods, deep learning should be applied in a systematic manner that is reproducible and rigorously tested. Ultimately, the tips we collate range from high-level guidance to best practices for implementation. It is our hope that they will provide actionable, deep learning-specific instruction for both new and experienced deep learning practitioners. By making deep learning more accessible for use in biological research, we aim to improve the overall usage and reporting quality of deep learning in the literature and enable increasing numbers of researchers to effectively and accurately utilize these state-of-the-art techniques.

Tip 1: Decide whether deep learning is appropriate for your problem

In recent years, the number of projects and publications implementing deep learning in biology has risen tremendously [9,10,11]. Given deep learning’s usefulness across a range of scientific questions and data modalities, it may seem as though it is a panacea for nearly all modeling problems. Neural networks that underpin deep learning models are, in fact, universal function approximators and are therefore theoretically capable of learning the functions that relate almost any input and output variables [12,13]. However, deep learning is not suited to every modeling situation. The primary limiting factors for deep learning’s suitability for a given problem is primarily limited by the training demands of neural network models, which require significant amounts of data, computing power, and programming as well as modeling expertise.

In the areas of biology where data collection is thoroughly automated, such as DNA sequencing, large amounts of high-quality data may be available. However, areas of biology that rely on manual data collection may not possess enough data to train and apply deep learning models effectively. Though there are methods that try to increase the amount of training data, such as data augmentation (in which existing data is slightly manipulated in an attempt to yield “new” samples) and weak supervision (in which simple labeling heuristics are combined to produce noisy, probabilistic labels) [14], these methods cannot overcome substantial data shortages.

In the fields of computer vision and natural language processing, deep neural networks are routinely trained on sample sizes ranging from hundreds of thousands to millions of training examples. Datasets of this size are often not available in many biological contexts. Still, it has been found that under certain circumstances, deep learning can already be considered for datasets with at least one hundred samples per class [15]. However, it is really best suited for use with datasets that contain orders of magnitude more samples.

Training deep learning models is also very demanding and often requires extensive computing infrastructure and patience to achieve state-of-the-art performance [16]. In some deep learning contexts, such as generating human-like text, state-of-the-art models have over one hundred billion parameters [17] and require very costly and time-consuming training procedures [18]. Though deep learning applications in biology rarely require this much training, they can require computational resources beyond those available on consumer-grade devices such as laptops or office desktops. Specialized hardware such as discrete graphics processing units (GPUs) and custom deep learning accelerators can dramatically reduce the time and cost required to train models [11]. Still, this hardware is not universally accessible, and cloud-based rentals add additional cost and complexity. Despite these limiting factors, these specialized hardware solutions are likely to be more broadly available as deep learning becomes more popular (for example, recent-generation iPhones already have such hardware). In contrast to these large scale computational demands of deep learning, traditional machine learning models can often be trained on laptops (or even on a $5 computer [19]) in seconds to minutes. Therefore, due to this enormous disparity in resource demand alone, traditional machine learning approaches may still prove desirable in various biological applications.

Beyond requiring more data and computational capacity, building and training deep learning models often requires more expertise than training traditional machine learning models. There are currently several competing programming frameworks for deep learning, such as Tensorflow [20] and PyTorch [21], that are widely used across academic research fields and industrial application sets. While these frameworks allow users to create and deploy entirely novel model architectures, this flexibility combined with the rapid development of the deep learning field has resulted in large and complex frameworks that can be daunting to new users. For readers new to software development but experienced in biology, gaining computational skills while interfacing with such complex industrial-grade tools can be a prohibitive challenge. Conversely, traditional machine learning methods are generally more straightforward to implement. There are currently more tools for automating the model selection and training process for traditional machine learning models than for deep learning models. For example, automated machine learning (AutoML) tools, such as TPOT [22] and Turi Create [23], are able to test and optimize multiple machine learning models automatically, and can allow users to achieve competitive performance with only a few lines of code. Thankfully, there are efforts underway to extend these and other automation frameworks to reduce the expertise required to build and use deep learning models. For example, TPOT, Turi Create, and AutoKeras [24] are already capable of abstracting away much of the programming required for “standard” deep learning tasks, and high-level interfaces such as Keras [25] and Fastai [26], make it increasingly straightforward to design and test custom deep learning architectures In the future, projects such as these are likely to make deep learning increasingly accessible to a much wider swatch of researchers.

Despite these limitations, deep learning is strongly indicated over traditional machine learning for specific research questions and problems. In general, these include problems that feature hidden patterns across the data, complex relationships, and interrelated variables. Problems in computer vision and natural language processing often exhibit these very features, which helps explain why these areas were some of the first to experience significant breakthroughs during the recent deep learning revolution [27]. As long as large amounts of accurate and labeled data are available, applications to areas of life sciences with related data characteristics, such as genetic medicine [28], radiology [29], microscopy [30], and pharmacovigilance [31], are similarly likely to benefit from deep learning techniques. For example, Ferreira et al. used deep learning to recognize individual birds from images [32] despite this problem being very difficult historically. By combining automatic data collection using RFID tags with data augmentation and transfer learning, the authors were able to use deep learning to achieve 90% accuracy across several species. Another research area where deep learning excels is generative modeling, where new samples are created based on the training data [33]. One other area of machine learning that has been revolutionized by deep learning is reinforcement learning, which is concerned with training agents to interact with an environment [34]. Overall, initial evaluation as to whether similar problems (including analogous ones in other domains) have been solved successfully using deep learning can inform researchers about the potential for deep learning to address their needs.

On the other hand, depending on the amount and type of data available and the nature of the problem set, deep learning may not always be able to outperform conventional methods, As an illustration, Rajkomar et al. [35] found that simpler baseline models achieved performance comparable with deep learning in several clinical prediction tasks using electronic health records. Another example is provided by Koutsoukas et al., who benchmarked several traditional machine learning approaches against deep neural networks for modeling bioactivity data on moderately sized datasets [36]. The researchers found that while well-tuned deep learning approaches generally tend to outperform conventional classifiers, simple methods such as Naive Bayes classification tend to outperform deep learning as the dataset’s noise increases. Similarly, Chen et al. [37] tested deep learning and a variety of traditional machine learning methods such as logistic regression and random forests on five different clinical datasets. They found that traditional methods matched or exceeded the accuracy of the deep learning model in all cases despite requiring an order of magnitude less training time.

Therefore, in conclusion, deep learning should only be used after a robust consideration of its strengths and weaknesses for the problem at hand. Once choosing deep learning as a potential solution, practitioners should still consider traditional methods as performance baselines and use the scientific method to compare the performance of deep learning to that of traditional methods, as outlined in the following tips.

Tip 2: Use traditional methods to establish performance baselines

Deep learning requires practitioners to consider a larger number and variety of tuning parameters (that is, algorithmic settings) than more traditional machine learning methods. These settings are often called hyperparameters, and their extensiveness can make it easy to fall into the trap of performing an unnecessarily convoluted analysis. Hence, before applying deep learning to a given problem, we highly recommend implementing a simpler model with fewer hyperparameters at the beginning of each study. Such models include logistic regression, random forests, k-nearest neighbors, naive Bayes, and support vector machines, and using them can help to establish baseline performance expectations. While performance baselines available from existing literature can also serve as helpful guides, an implementation of a simpler model that uses the same software framework as planned for deep learning can greatly help with assessing the correctness of data processing steps, performance evaluation pipelines, resource requirement estimates, and computational performance estimates. Furthermore, in some cases, it can even be useful to combine simpler baseline models with deep neural networks, as such hybrid models can improve generalization performance, model interpretability, and confidence estimation [38,39].

Another potential pitfall arises from comparing the performance of baseline conventional models trained with default settings with the performance of deep learning models that have undergone rigorous tuning and optimization. Since conventional off-the-shelf machine learning algorithms (for example, support vector machines and random forests) are also likely to benefit from hyperparameter tuning, such incongruity prevents the comparison of equally optimized models and can lead to false conclusions about model efficacy. Hu and Greene [40] discuss this under the umbrella of what they call the “Continental Breakfast Included” effect, and they describe how the unequal tuning of hyperparameters across different learning algorithms can especially skew evaluation when the performance of an algorithm varies substantially with modest changes to its hyperparameters. Therefore, practitioners should tune the settings of both traditional machine and deep learning-based methods before making claims about relative performance differences, as performance comparisons among machine learning and deep learning models are only informative when the models are equally well optimized.

To sum this tip up, practitioners are encouraged to create and fully tune several traditional models and standard pipelines before implementing a deep learning model.

Tip 3: Understand the complexities of training deep neural networks

Correctly training deep neural networks is a non-trivial process, as there are many different options and potential pitfalls at every stage. To get good results, one must often train networks across a wide range of different hyperparameter settings. Such training can be made more difficult by the demanding nature of these deep networks, which often require extensive time investments into tuning and computing infrastructure to achieve state-of-the-art performance [16]. Furthermore, this experimentation is often noisy, necessitating increased repetition and exacerbating the organizational challenges inherent to deep learning. On the whole, all code, random seeds, parameters, and results must be carefully corralled using general coding standards and best practices (for example, version control [41] and continuous integration [42]) to be reproducible and interpretable [43,44]. For application-based research, this organization is also fundamental to the efficient sharing of research work and the ability to keep models up to date as new data becomes available.

One specific reproducibility pitfall that is often missed in applying deep learning is the default use of non-deterministic algorithms by CUDA/CuDNN backends when using GPUs. That is, the CUDA/CuDNN architectures that facilitate the parallelized computing that power state-of-the-art deep learning often use algorithms by default that produce different outcomes from iteration to iteration. Therefore, achieving reproducibility in this context requires explicitly specifying the use of deterministic algorithms (which are typically available within deep learning libraries), which is distinct from the setting of random seeds that typically achieve reproducibility by controlling pseudorandom deterministic procedures such as shuffling and initialization [45].

Similar to the suggestions above about starting with simpler models, try to start with a relatively small network and then increase the size and complexity as needed. This can help prevent practitioners from wasting significant time and resources on running highly complex models that feature numerous unresolved problems. Again, beware of the choices made implicitly (that is, by default settings) by deep learning libraries (for example, selection of optimization algorithm), as these seemingly trivial specifics can have significant effects on model performance. For example, adaptive methods often lead to faster convergence during training but may lead to worse generalization performance on independent datasets [46]). These nuanced elements are easy to overlook, but it is critical to consider them carefully and to evaluate their potential impact.

In short, use smaller and simpler networks to enable faster prototyping and follow general software development best practices to maximize reproducibility.

Tip 4: Know your data and your question

Having a well defined scientific question and a clear analysis plan is crucial for carrying out a successful deep learning project. Just like it would be inadvisable to set foot in a laboratory and begin experiments without having a defined endpoint, a deep learning project should not be undertaken without defined goals. Foremost, it is important to assess if a dataset exists that can answer the biological question of interest using a deep learning-based approach. If so, obtaining this data (and associated metadata), and reviewing the study protocol, should be pursued as early on in the project as possible. This can help to ensure that data is as expected and can prevent the wasted time and effort that occur when issues are discovered later on in the analytic process. For example, a publication or resource might purportedly offer an appropriate dataset that is found to be inadequate upon acquisition. The data may be unstructured when it is supposed to be structured, crucial metadata such as sample stratification might be missing, or the usable sample size may be different than expected. Any of these data issues might limit a researcher’s ability to use deep learning to address the biological question at hand or might otherwise require adjustment before it can be used. Data collection should also be carefully documented, or a data collection protocol should be created and specified in the project documentation.

Information about the resources used, download dates, and dataset versions are critical to preserve. Doing so will help to minimize operational confusion and will increase the reproducibility of the analysis. Best practices for reproducibility also include sharing the collected dataset and metadata along upon publication of the study, ideally in a public dataset repository if there are no ethical or privacy concerns and no copyright restrictions. While recommended and recognized dataset repositories may differ across disciplines, a list of general dataset repositories includes the Dryad repository [47] (https://datadryad.org/), Figshare [48] (https://figshare.com), Zenodo [49] (https://zenodo.org), and the Open Science Framework [50] (https://osf.io). In addition, Gundersen et al. [51] provide useful checklists summarizing best data sharing practices for reproducible research and open science.

Once the dataset is obtained, it is important to learn why and how the data were collected before beginning analysis. The standardized metadata that exists in many fields can help with this (for example, see [52]). In addition, if at all possible, we recommend consulting with a subject matter expert who has experience with the type of data being used. Doing so will minimize guesswork and is likely to increase the success rate of a deep learning project. For example, one might presume that data collected to test the impact of an intervention derives from a randomized controlled trial. However, this is not always the case, as ethical or practical concerns often necessitate an observational study design that features prospectively or retrospectively collected data. In order to ensure similar distributions of important characteristics across study groups in the absence of randomization, such a study may have selected individuals in a fashion that best matches attributes such as age, gender, or weight. Passively collected datasets can have their own peculiarities, and other study designs can include samples that originate from the same study site, the oversampling of ethnic groups or zip codes, or sample processing differences. Such information is critical to accurate data analysis, and so it is imperative that practitioners learn about study design assumptions and data specificities prior to performing modeling. Other study design considerations that should not be overlooked include knowing whether a study involves biological or technical replicates or both. For example, the existence in a dataset of samples collected from the same individuals at different time points can have significant effects on analyses that make assumptions about sample size and independence (that is, non-independence can lower the effective sample size). Another potential issue is the existence of systematic biases, which can be induced by confounding variables and can lead to artifacts or so-called “batch effects.” Consequently, models may learn to rely on the correlations that these systematic biases underpin, even though they are irrelevant to the scientific context of the study. This can lead to misguided predictions and misleading conclusions [53]. Unsupervised learning and other exploratory analyses can help identify such biases in these datasets before applying a deep learning model.

Overall, practitioners should thoroughly study their data and understand its context and peculiarities before moving on to performing deep learning.

Tip 5: Choose an appropriate data representation and neural network architecture

While certain best practices have been established by the research community [54], architecture design choices remain largely problem-specific and are vastly empirical efforts requiring extensive experimentation. Furthermore, as deep learning is a quickly evolving field, many recommendations are often short-lived and are frequently replaced by newer insights supported by recent empirical results. This is further complicated by the fact that many recommendations do not generalize well across different problems and datasets. Therefore, unfortunately, choosing how to represent data and design an architecture is closer to an art than a science. That said, there are some general principles that are useful to follow when experimenting.

First and foremost, use your knowledge of the available data and your question to inform your data representation and architectural design choices. For example, if the dataset is an array of measurements with no natural ordering of inputs (such as gene expression data), multilayer perceptrons (MLPs) may be effective. These are the most basic type of neural network, and they are able to learn complex non-linear relationships across the input data despite their relative simplicity. Similarly, if the dataset is comprised of images, convolutional neural networks (CNNs) are a good choice because they emphasize local structures and adjacency within the data. CNNs may also be a good choice for learning on sequences, as recent empirical evidence suggests that they can outperform canonical sequence learning techniques such as recurrent neural networks (RNNs) and the closely related long short-term memory (LSTM) networks [55]. Accessible high-level overviews of these different neural network architectures are provided in [56] and [57].

Deep learning models typically benefit from increasing the amount of labeled data with which to train on. Large amounts of data help to avoid overfitting, and increase the likelihood of achieving top performance on a given task. If there is not enough data available to train a well-performing model, consider using transfer learning. In transfer learning, a model whose weights were generated by training on another dataset is used as the starting point for training [58]. Transfer learning is most useful when the pre-training and target datasets are of similar nature [58]. For this reason, it is important to search for similar datasets that are already available. These can potentially be used to increase the size of the training set or for pre-training and subsequent fine-tuning on the target data. However, even when this assumption does not hold, transferring features can still improve model performance compared with random feature initialization. For example Rojkomar et al. showed advantages of ImageNet-pretraining [59] for a model that is applied to grayscale medical image classification [60]. In addition, or as an alternative to pre-training models on larger datasets for transfer learning, one may be able to obtain pre-trained models from public repositories, such as Kipoi [61] for genomics models. Moreover, learned features could be helpful even when a pre-training task is different from a target task [62]. Recently, the concept of self-supervised learning, which is closely related to pre-training and transfer learning, has seen an increase in popularity [63]. Self-supervised learning leverages large amounts of unlabeled data and uses naturally available information as labels for supervised learning. Thus, self-supervised learning is sometimes also described as autonomous supervised learning. Using self-supervised learning, a model can be pre-trained on a related task before it is trained on the target task. Another related approach is multi-task learning, which simultaneously trains a network for multiple separate tasks that share features. In fact, multi-task learning can be used separately or even in combination with transfer learning [64].

This tip can be distilled into two main action points: first, base the network’s architecture on knowledge of the problem and, second, take advantage of similar existing data or pre-trained deep learning models.

Tip 6: Tune your hyperparameters extensively and systematically

Given at least one hidden layer, a non-linear activation function, and a large number of hidden units, multi-layer neural networks can approximate arbitrary continuous functions that relate input and output variables [13,65]. Deeper architectures that feature additional hidden layers and an increasing number of overall hidden units and learnable weight parameters (the so-called increasing “capacity” of neural networks) allow for solving increasingly complex problems. However, this increased capacity results in many more parameters to fit and hyperparameters to tune, which can pose additional challenges during model training. In general, one should expect to systematically evaluate the impact of numerous hyperparameters when applying deep neural networks to new data or challenges. Hyperparameters typically manifest as choices or settings of optimization algorithms, loss function, learning rate, activation functions, number of hidden layers and hidden units, size of the training batches, weight initialization schemes, and seeds for pseudo-random number generators used for dataset shuffling and weight initialization. Moreover, additional hyperparameters are introduced by common techniques that facilitate the training of deeper architectures. These include parameter norm penalties (typically in the form of $L^2$ regularization), dropout [66], and batch normalization [67], which can reduce the effect of the so-called vanishing or exploding gradient problem when working with deep neural networks.

This wide array of potential parameters can make it difficult to evaluate the extent to which neural network methods are well suited to solving a task, as it can be unclear to practitioners whether previous successful applications were the result of interactions between unique data attributes and specific hyperparameter settings. Similar to the Continental Breakfast Included effect discussed above, a lack of clarity on how extensive arrays of hyperparameters were tested and/or chosen can affect method developers as they attempt to compare techniques. This effect also has implications for those seeking to use existing deep learning methods, as performance estimates from deep neural networks are often provided after tuning. The implication of this effect on users of deep neural networks is that attaining performance numbers that match those reported in publications is likely to require significant effort towards temporally expensive hyperparameter optimization.

Ultimately, to get the best performance from your model, be sure to systematically optimize your hyperparameters on your training dataset, as introduced in the next section.

Tip 7: Address deep neural networks’ increased tendency to overfit the dataset

Overfitting is a challenge inherent to machine learning in general, and is one of the most significant challenges you’ll face when applying deep learning specifically. Overfitting occurs when a model fits patterns in the training data so closely that it is including non-generalizable noise or non-scientifically relevant perturbations in the relationships it is learning. In other words, the model fits patterns that are overly specific to the data it is training on rather than learning general relationships that hold across similar datasets. This subtle distinction is made clearer by seeing what happens when a model is tested on data to which it was not exposed during training: just as a student who memorizes exam materials struggles to correctly answer questions for which they have not studied, a machine learning model that has overfit to its training data will perform poorly on unseen test data. Deep learning models are particularly susceptible to overfitting due to their relatively large number of parameters and associated representational capacity. Just as some students may have greater potential for memorization, deep learning models seem more prone to overfitting than machine learning models with fewer parameters.

In general, one of the most effective ways to combat overfitting is to detect it in the first place. One way to do this is to split the main dataset being worked on into three independent parts: a training set, a tuning set (also commonly called a validation set in the machine learning literature), and a test set. These three partitions allow us to optimize models by iterating between model learning on the training set and hyperparameter evaluation on the tuning set without affecting the final model assessment on the test set. That is, the data used for testing should be “locked away” and used only once to evaluate the final model after all training and tuning steps are completed. A researcher can then use the model’s performance on the independent test data as a measure of how overfit (i.e. non-generalizable) the model is. This type of approach is necessary for evaluating the generalizability of models without the biases that can arise from learning and testing on the same data [68,69]. While a slight drop in performance from the training set to the test set is normal, a significant drop is a clear sign of overfitting (see Figure 2 for a visual demonstration of an overfit model that performs poorly on test data).

If overfitting is an issue, there are a variety of techniques to reduce overfitting, including data augmentation and various regularization techniques [70,71]. Another way to reduce overfitting, as described by Chuang and Keiser, is to identify the baseline level of memorization that is occuring by training on data that has its labels randomly shuffled. By comparing the model performance with the shuffled data to that achieved with the actual data [72], a practitioner can identify overfitting as a model that performs no better on real data, as this suggest that any predictive capacity is not due to data-driven signal. One important caveat when working with partitioned data is the need to apply transformation and normalization procedures equally to all datasets. The parameters required for such procedures (for example, quantile normalization, a common standardization method when analyzing gene-expression data) should only be derived from the training data, and not from the tuning or test data. Additionally, many conventional metrics for classification (e.g. area under the receiver operating characteristic curve or AUROC) have limited utility in cases of extreme class imbalance [73]. Therefore, model performance should be evaluated with a carefully picked panel of relevant metrics that make minimal assumptions about the composition of the testing data [74].

When working with biological and medical data, one must also carefully consider potential sources of bias and/or non-independence when defining training and test sets. For example, a deep learning model for pneumonia detection in chest X-rays appeared to performed well within the hospitals providing the training data, but then failed to generalize to other hospitals [75]. This resulted from the deep learning model picking up on signal related to which hospital the images were from, and represents a type of artifact or “batch effect” that practitioners must be vigilant towards. When dealing with sequence data, holding out test data that are evolutionarily related or that share structural homology to the training data can result in overfitting that is hard to detect due to the inherent relatedness of the partitioned data [76]. In such situations, simply holding out test data selected from a random partition of the training data can be insufficient. Again, the best remedy for identifying confounding variables is to know your data and to test models on truly independent data.

In essence, practitioners should split data into training, tuning, and single-use testing sets to assess the performance of the model on data that can provide a reliable estimate of its generalization performance. Futhermore, be cognizant of the danger of skewed or biased data artificially inflating accuracy.

Tip 8: Deep learning models can be made more transparent

While model interpretability is a broad concept, in much of the machine learning literature (including in our guidelines), it refers to the ability to identify the discriminative features that influence or sway the predictions. In certain cases, the goal behind interpretation is to understand the underlying data generating processes and biological mechanisms [77]. In other cases, the goal is to understand why a model made the prediction that it did for a specific example or set of examples. Machine learning models vary widely in terms of interpretability: some are fully transparent while others are considered “black-boxes” that make predictions with little ability to examine why. Logistic regression and decision tree models are generally considered interpretable [78]. In contrast, deep neural networks are often considered among the most difficult to interpret because they can have many parameters and non-linear relationships.

Knowing which of the input variables influences a model’s predictions, and potentially in what ways, can help with the application or extrapolation of machine learning models. This is particularly important in biomedicine, where subsequent decision making often requires human input, and where models are employed with the hope of better understanding why relationships exist in the first place. Furthermore, while prediction rules can be derived from high-throughput molecular datasets, most affordable clinical tests still rely on lower-dimensional measurements of a limited number of biomarkers. Therefore, it is often still unclear how to translate the predictive capacity of deep learning models that encompassing non-linear relationships between countless input variables into clinically digestible terms. As a result, selecting which biomarkers to use for decision making remains an important modeling and interpretation challenge. In fact, many authors attribute a lower uptake of deep learning tools in healthcare to interpretability challenges [79,80]. Nonetheless, strategies to interpret both machine learning and deep learning models are rapidly emerging, and the literature on the topic is growing exponentially [81]. Instead of recommending specific methods for either deep learning-specific or general-purpose model interpretation, we suggest consulting [82], which is freely available and continually updated.

While active research into model interpretability is enabling increased interpretation of models with many parameters and non-linear relationships, simpler traditional machine learning models often remain substantially easier to interpret. When deciding on a machine learning approach and model architecture, consider an interpretability versus accuracy tradeoff. A challenge in considering this tradeoff is that the extent to which one trades interpretability for accuracy depends on the problem itself. Recent research has also shown that which interpretability method is best for which model also depends on the model’s predictive performance for a given problem and dataset regarding the reliability of local and global explanations [83]. As a rule of thumb, when the features provided to the model are already highly relevant to the task at hand, a simpler and more interpretable model that gives up only a little performance is often more useful. On the other hand, if features must be combined in complex ways to be meaningful for the task, the performance difference of a model capable of capturing that structure may outweigh the interpretability costs. An appropriate choice can only be made after careful consideration, which often includes estimating the performance of a simple linear model that serves as a baseline. In cases where models are learned from high-throughput datasets, a small subset of features in the dataset may be strongly correlated with the complex combination of the larger feature set defined from the deep learning model. In this case, this more limited number of features can be used in the subsequent simplified model to enhance the model’s interpretability further. This feature reduction can be essential when defining biomarker panels for use in clinical applications.

Tip 9: Don’t over-interpret predictions

Once we have trained an accurate deep learning model, we often want to use it to deduce relationships and inform scientific findings. However, in doing this, we need to be careful to interpret the model’s predictions correctly. Given that deep learning models can be difficult to interpret intuitively, there is often a temptation to overinterpret the predictions in indulgent and/or inaccurate ways. In accordance with the classic statistical saying, “correlation doesn’t imply causation,” predictions by deep learning models don’t necessarily speak to certain causal relationships. While we generally know this and understand that accurately predicting an outcome doesn’t imply the learning of any causal mechanism, it can be easy to forget this lesson when the predictions are extremely accurate. A poignant example of this lesson is from work where authors evaluated the capacities of several models to predict the probability of death for patients with pneumonia admitted to an intensive care unit [84,85]. Unsurprisingly, the neural network model achieved the best predictive accuracy. However, after fitting a rule-based model to understand the relationships inherent to their data better, the authors discovered that the hospital data implied the rule “$\text{HasAsthma}(x) \Rightarrow \text{LowerRisk}(x)$.” This rule contradicts medical understanding, as having asthma doesn’t make pneumonia better! Nonetheless, the data supported this rule, as pneumonia patients with a history of asthma tended to receive more aggressive care. The neural network had, therefore, also learned to make predictions according to this rule despite the fact that it has nothing to do with causality or mechanism. Therefore, it would have been disastrous to guide treatment decisions according to the predictions of the neural network, even though the neural network had high predictive accuracy.

To trust deep learning models, we must combine knowledge of the training data with inspection of the model. To move beyond fitting predictive models and towards the building of an understanding that can inform scientific deduction, we suggest working to disentangle a model’s internal logic by comparing data domains where models succeed to those in which they fail. By doing so, we can avoid overinterpreting models and view them for what they are: complex statistical models trained on high dimensional data.

Tip 10: Actively consider the ethical implications of your work

While deep learning continues to be a powerful, transformative tool within life sciences research—spanning basic biology and pre-clinical science to varied translational approaches and clinical studies—it is important to comment on some ethics—related considerations. For instance, despite the fact that deep learning methods are helping to increase medical efficiency through improved diagnostic capability and risk assessment, certain biases may be inadvertently introduced into models related to patient age, race, and gender [86]; as previously mentioned, deep learning practitioners may make use of datasets not representative of diverse populations and patient characteristics [87], thereby contributing to these problems (please refer to Tip 4).

Therefore, it is important to think thoroughly and cautiously about deep learning applications and its potential impact to persons and society—mindful of possible harms, injuries, injustices, and other types of wrongdoings. At a minimum, practitioners must ensure that, wherever relevant, their life sciences projects are fully compliant with local research governance/approval policies, legal requirements, institutional review board (IRB) policies, and any other relevant bodies and their standards. Moreover, we offer below three tangible, action-oriented recommendations to further empower and enrichen deep learning researchers.

First, just as it is a best practive to keep a project-specific or programming-related issue tracker detailing known bugs and other technical issues, practitioners should get into the habit of keeping an active ethics register. In this register, ethical concerns can be raised, recorded, and resolved, exactly as software problems are triaged and fixed. Because projects using deep learning usually rely on writing code, an ethics register can be a part of the issue tracker in the version-control system for the software itself. By colocating the two, practioners can operationalize the concept that ethical problems arer “bugs” that must be resolved, not nice-to-haves that can be considered at some indefinite point in the future. For practitioners intending to distribute trained models, having an ethics register can also facilitate creating a model card [88], a short document specifying the domains in which the model’s performance was validated (for example, which model organism was used) and how the performance was benchmarked, along with known limitations and concerns. Second, to help foster a conscious ethics-oriented mindset, researchers should consider expanding journal clubs to include scholarly and popular articles detailing real-world ethics issues relevant to their scientific fields. This will help researchers to think more holistically and judiciously about their work and its implications. Third, we encourage individual- and team-level participation in professional societies [89] and other types of organizations [90] and events [91] related to the domains of AI and data ethics as well as bioethics. This will encourage a sense of community and intellectual engagement, keeping practitioners abreast of cutting-edge insights and emerging professional standards.

Furthermore, practitioners may encounter datasets that cannot be shared, such as ones for which there would be significant ethical or legal issues associated with their release [92]. Examples of such data include classified or confidential data, biological data related to trade secrets, medical records, or other personally identifiable information [93]. While deep learning models can capture information-rich abstractions of multiple features of the data during the training process (which represents one of its great strengths), these features may be more prone to leak the data that they were trained over if the model is shared or allowed to be queried with arbitrary inputs [94,95]. In other words, the complex relationships learned about the input data can potentially be used to infer characteristics about the original dataset. This means that the strengths that imbue deep learnings with its great predictive capacity also raise the level of risk surrounding data privacy. Therefore, while there is tremendous promise for deep learning techniques to extract information that cannot readily be captured by traditional methods [96], it is imperative not to share models trained on sensitive data. This also holds true for certain traditional machine learning methods that learn by capturing specific details of the full training data (for example, k-nearest neighbors models).

Techniques to train deep neural networks without sharing unencrypted access to data are being advanced through implementations of homomorphic encryption, which serves to enable equivalent prediction on data that is encrypted end to end [97,98]. Privacy-preserving techniques [99], such as differential privacy [100,101,102], can help to mitigate risks as long as the assumptions underlying these techniques are met. These methods provide a path towards a future where trained models and their predictions can be shared, but more software development and theoretical advances will be required to make these techniques easy to apply correctly in many settings. Unless you use these techniques, don’t share the weights or arbitrary access to the predictions of models trained on sensitive data.

Conclusion

Collectively, our manuscript is focused on the promotion of practical tips distilled from cutting-edge insights and evolving professional standards to advance the efficient and optimal application of deep learning within research. It is evident that some of our points (see Tips 7, 8, 9, and 10) are intimately linked to safeguarding against key risks: for example, introduction/perpetuation of bias, overinterpretation/misinterpretation of models, poor generalizability, and potential for harm unto others—which can have a mix of ethical, legal, and social implications. If leveraged in ethical and responsible ways, deep learning techniques have the potential to add value within a diverse array of research and healthcare contexts, as these techniques have already shown remarkable capacity to meet or exceed the performance of human effort and/or older algorithms across fields and subdisciplines. Beyond merely achieving good predictive performance in certain tasks, deep learning has the potential to uncover high-impact biological and clinical insights, fundamentally driving research discoveries and delivery of new products to market. Yet, to realize its full potential, deep learning must be approached by all with genuine thoughtfulness, caution, and responsibility.

Through the tips and recommendations provided within this manuscript, we hope to encourage a prudent, vigilant community of computational practitioners, experimental biologists, and clinical scientists: colleagues who, before excitedly stitching together lines of code and datasets, first pause to think, dialogue, plan, and discern how their work might have far-reaching consequences with ethical dimensions. This holistic approach will help us to advance accountability, beneficence, and quality in science.

Thus, we aim not only to increase the accessibility of deep learning techniques within the life sciences, but also to improve upon the reproducibility and interpretability of high-quality deep learning research in the literature and scientific community—especially given that published findings, models, and datasets will be leveraged to yield innovative tools, services, and products in the marketplace. Indeed, we hope that these tips will serve as a powerful engine for promoting meaningful discussions, reflections, team learnings, and best practices to drive collaboration that fosters cutting-edge deep learning innovation, sensibly and responsibly.

Acknowledgements

The authors would like the thank Daniel Himmelstein and the developers of Manubot for creating the software that enabled the collaborative composition of this manuscript. We would also like to thank [TODO: insert the names of the contributors who don’t meet the standards for authorship] for their contributions to the discussions that comprised the initial stage of the drafting process.

References

1. Backpropagation and the brain
Timothy P. Lillicrap, Adam Santoro, Luke Marris, Colin J. Akerman, Geoffrey Hinton
Nature Reviews Neuroscience (2020-04-17) https://doi.org/ggsc7t
DOI: 10.1038/s41583-020-0277-3 · PMID: 32303713

2. Opportunities and obstacles for deep learning in biology and medicine
Travers Ching, Daniel S. Himmelstein, Brett K. Beaulieu-Jones, Alexandr A. Kalinin, Brian T. Do, Gregory P. Way, Enrico Ferrero, Paul-Michael Agapow, Michael Zietz, Michael M. Hoffman, … Casey S. Greene
Journal of The Royal Society Interface (2018-04-04) https://doi.org/gddkhn
DOI: 10.1098/rsif.2017.0387 · PMID: 29618526 · PMCID: PMC5938574

3. VAMPnets for deep learning of molecular kinetics
Andreas Mardt, Luca Pasquali, Hao Wu, Frank Noé
Nature Communications (2018-01-02) https://doi.org/gcvf62
DOI: 10.1038/s41467-017-02388-1 · PMID: 29295994 · PMCID: PMC5750224

4. Deep learning to predict the lab-of-origin of engineered DNA
Alec A. K. Nielsen, Christopher A. Voigt
Nature Communications (2018-08-07) https://doi.org/gd27sw
DOI: 10.1038/s41467-018-05378-z · PMID: 30087331 · PMCID: PMC6081423

5. Identifying facial phenotypes of genetic disorders using deep learning
Yaron Gurovich, Yair Hanani, Omri Bar, Guy Nadav, Nicole Fleischer, Dekel Gelbman, Lina Basel-Salmon, Peter M. Krawitz, Susanne B. Kamphausen, Martin Zenker, … Karen W. Gripp
Nature Medicine (2019-01-07) https://doi.org/czdm
DOI: 10.1038/s41591-018-0279-0 · PMID: 30617323

6. Benjamin-Lee/deep-rules GitHub repository
Benjamin Lee
GitHub (2018) https://github.com/Benjamin-Lee/deep-rules

7. Open collaborative writing with Manubot
Daniel S. Himmelstein, Vincent Rubinetti, David R. Slochower, Dongbo Hu, Venkat S. Malladi, Casey S. Greene, Anthony Gitter
PLOS Computational Biology (2019-06-24) https://doi.org/c7np
DOI: 10.1371/journal.pcbi.1007128 · PMID: 31233491 · PMCID: PMC6611653

8. Ten quick tips for machine learning in computational biology
Davide Chicco
BioData Mining (2017-12-08) https://doi.org/gdb9wr
DOI: 10.1186/s13040-017-0155-3 · PMID: 29234465 · PMCID: PMC5721660

9. Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integration in Precision Medicine
Dmitry Grapov, Johannes Fahrmann, Kwanjeera Wanichthanarak, Sakda Khoomrung
OMICS: A Journal of Integrative Biology (2018-10) https://doi.org/gfjjgn
DOI: 10.1089/omi.2018.0097 · PMID: 30124358 · PMCID: PMC6207407

10. Deep Learning Techniques: An Overview
Amitha Mathew, P. Amudha, S. Sivakumari
Advances in Intelligent Systems and Computing (2021) https://doi.org/ghjtg6
DOI: 10.1007/978-981-15-3383-9_54

11. Machine Learning in Python: Main Developments and Technology Trends in Data Science, Machine Learning, and Artificial Intelligence
Sebastian Raschka, Joshua Patterson, Corey Nolet
Information (2020-04-04) https://doi.org/ghjtg8
DOI: 10.3390/info11040193

12. Approximation by superpositions of a sigmoidal function
G. Cybenko
Mathematics of Control, Signals, and Systems (1989-12) https://doi.org/dp3968
DOI: 10.1007/bf02551274

13. Approximation capabilities of multilayer feedforward networks
Kurt Hornik
Neural Networks (1991) https://doi.org/dzwxkd
DOI: 10.1016/0893-6080(91)90009-t

14. Data Programming: Creating Large Training Sets, Quickly
Alexander Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, Christopher Ré
arXiv (2016-05-25) https://arxiv.org/abs/1605.07723v3

15. How much data is needed to train a medical image deep learning system to achieve necessary high accuracy?
Junghwan Cho, Kyewook Lee, Ellie Shin, Garry Choy, Synho Do
arXiv (2016-01-11) https://arxiv.org/abs/1511.06348

16. Efficient Processing of Deep Neural Networks: A Tutorial and Survey
Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, Joel S. Emer
Proceedings of the IEEE (2017-12) https://doi.org/gcnp38
DOI: 10.1109/jproc.2017.2761740

17. Language Models are Few-Shot Learners
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, … Dario Amodei
arXiv (2020-07-24) https://arxiv.org/abs/2005.14165

18. Energy and Policy Considerations for Deep Learning in NLP
Emma Strubell, Ananya Ganesh, Andrew McCallum
arXiv (2019-06-07) https://arxiv.org/abs/1906.02243

19. A Machine Learning Driven IoT Solution for Noise Classification in Smart Cities
Yasser Alsouda, Sabri Pllana, Arianit Kurti
arXiv (2018-09-05) https://arxiv.org/abs/1809.00238

20. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, … Xiaoqiang Zheng
arXiv (2016-03-17) https://arxiv.org/abs/1603.04467

21. PyTorch: An Imperative Style, High-Performance Deep Learning Library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, … Soumith Chintala
arXiv (2019-12-05) https://arxiv.org/abs/1912.01703

22. Automating Biomedical Data Science Through Tree-Based Pipeline Optimization
Randal S. Olson, Ryan J. Urbanowicz, Peter C. Andrews, Nicole A. Lavender, La Creis Kidd, Jason H. Moore
Lecture Notes in Computer Science (2016) https://doi.org/ggfptv
DOI: 10.1007/978-3-319-31204-0_9

23. apple/turicreate
Apple
(2021-01-14) https://github.com/apple/turicreate

24. Auto-Keras: An Efficient Neural Architecture Search System
Haifeng Jin, Qingquan Song, Xia Hu
arXiv (2019-03-27) https://arxiv.org/abs/1806.10282

25. Keras: the Python deep learning API https://keras.io/

26. Fastai: A Layered API for Deep Learning
Jeremy Howard, Sylvain Gugger
Information (2020-02-16) https://doi.org/ggmbms
DOI: 10.3390/info11020108

27. ImageNet classification with deep convolutional neural networks
Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton
Communications of the ACM (2017-05-24) https://doi.org/gbhhxs
DOI: 10.1145/3065386

28. A Deep Learning Approach for Predicting Antidepressant Response in Major Depression Using Clinical and Genetic Biomarkers
Eugene Lin, Po-Hsiu Kuo, Yu-Li Liu, Younger W.-Y. Yu, Albert C. Yang, Shih-Jen Tsai
Frontiers in Psychiatry (2018-07-06) https://doi.org/gdv7r2
DOI: 10.3389/fpsyt.2018.00290 · PMID: 30034349 · PMCID: PMC6043864

29. Deep learning with convolutional neural network in radiology
Koichiro Yasaka, Hiroyuki Akai, Akira Kunimatsu, Shigeru Kiryu, Osamu Abe
Japanese Journal of Radiology (2018-03-01) https://doi.org/ggb3tf
DOI: 10.1007/s11604-018-0726-3 · PMID: 29498017

30. Deep learning microscopy
Yair Rivenson, Zoltán Göröcs, Harun Günaydin, Yibo Zhang, Hongda Wang, Aydogan Ozcan
Optica (2017-11-20) https://doi.org/gf8dhj
DOI: 10.1364/optica.4.001437

31. Deep learning for pharmacovigilance: recurrent neural network architectures for labeling adverse drug reactions in Twitter posts
Anne Cocos, Alexander G Fiks, Aaron J Masino
Journal of the American Medical Informatics Association (2017-07) https://doi.org/gbp9nj
DOI: 10.1093/jamia/ocw180 · PMID: 28339747 · PMCID: PMC7651964

32. Deep learning‐based methods for individual recognition in small birds
André C. Ferreira, Liliana R. Silva, Francesco Renna, Hanja B. Brandl, Julien P. Renoult, Damien R. Farine, Rita Covas, Claire Doutrelant
Methods in Ecology and Evolution (2020-07-26) https://doi.org/d438
DOI: 10.1111/2041-210x.13436

33. Deep generative models: Survey
Achraf Oussidi, Azeddine Elhassouny
Institute of Electrical and Electronics Engineers (IEEE) (2018-04) https://doi.org/ghjtg7
DOI: 10.1109/isacv.2018.8354080

34. Deep Reinforcement Learning that Matters
Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, David Meger
arXiv (2019-01-31) https://arxiv.org/abs/1709.06560

35. Scalable and accurate deep learning with electronic health records
Alvin Rajkomar, Eyal Oren, Kai Chen, Andrew M. Dai, Nissan Hajaj, Michaela Hardt, Peter J. Liu, Xiaobing Liu, Jake Marcus, Mimi Sun, … Jeffrey Dean
npj Digital Medicine (2018-05-08) https://doi.org/gdqcc8
DOI: 10.1038/s41746-018-0029-1 · PMID: 31304302 · PMCID: PMC6550175

36. Deep-learning: investigating deep neural networks hyper-parameters and comparison of performance to shallow methods for modeling bioactivity data
Alexios Koutsoukas, Keith J. Monaghan, Xiaoli Li, Jun Huan
Journal of Cheminformatics (2017-06-28) https://doi.org/gfwv4d
DOI: 10.1186/s13321-017-0226-y · PMID: 29086090 · PMCID: PMC5489441

37. Deep learning and alternative learning strategies for retrospective real-world clinical data
David Chen, Sijia Liu, Paul Kingsbury, Sunghwan Sohn, Curtis B. Storlie, Elizabeth B. Habermann, James M. Naessens, David W. Larson, Hongfang Liu
npj Digital Medicine (2019-05-30) https://doi.org/ghfwhh
DOI: 10.1038/s41746-019-0122-0 · PMID: 31304389 · PMCID: PMC6550223

38. Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning
Nicolas Papernot, Patrick McDaniel
arXiv (2018-03-14) https://arxiv.org/abs/1803.04765

39. To Trust Or Not To Trust A Classifier
Heinrich Jiang, Been Kim, Melody Y. Guan, Maya Gupta
arXiv (2018-10-30) https://arxiv.org/abs/1805.11783

40. Parameter tuning is a key part of dimensionality reduction via deep variational autoencoders for single cell RNA transcriptomics
Qiwen Hu, Casey S. Greene
World Scientific Pub Co Pte Lt (2018-11) https://doi.org/gf5pc7
DOI: 10.1142/9789813279827_0033

41. Ten Simple Rules for Taking Advantage of Git and GitHub
Yasset Perez-Riverol, Laurent Gatto, Rui Wang, Timo Sachsenberg, Julian Uszkoreit, Felipe da Veiga Leprevost, Christian Fufezan, Tobias Ternent, Stephen J. Eglen, Daniel S. Katz, … Juan Antonio Vizcaíno
PLOS Computational Biology (2016-07-14) https://doi.org/gbrb39
DOI: 10.1371/journal.pcbi.1004947 · PMID: 27415786 · PMCID: PMC4945047

42. Reproducibility of computational workflows is automated using continuous analysis
Brett K Beaulieu-Jones, Casey S Greene
Nature Biotechnology (2017-03-13) https://doi.org/f9ttx6
DOI: 10.1038/nbt.3780 · PMID: 28288103 · PMCID: PMC6103790

43. Ten Simple Rules for Reproducible Computational Research
Geir Kjetil Sandve, Anton Nekrutenko, James Taylor, Eivind Hovig
PLoS Computational Biology (2013-10-24) https://doi.org/pjb
DOI: 10.1371/journal.pcbi.1003285 · PMID: 24204232 · PMCID: PMC3812051

44. Ten Simple Rules for Reproducible Research in Jupyter Notebooks
Adam Rule, Amanda Birmingham, Cristal Zuniga, Ilkay Altintas, Shih-Cheng Huang, Rob Knight, Niema Moshiri, Mai H. Nguyen, Sara Brin Rosenthal, Fernando Pérez, Peter W. Rose
arXiv (2018-10-19) https://arxiv.org/abs/1810.08055

45. Deep Learning SDK Documentation
NVIDIA
(2018-11-01) https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#reproducibility

46. Phonetic Classification and Recognition Using the Multi-Layer Perceptron
Hong Leung, James Glass, Michael Phillips, Victor W. Zue
Advances in Neural Information Processing Systems (1991) https://proceedings.neurips.cc/paper/1990/file/3dd48ab31d016ffcbf3314df2b3cb9ce-Paper.pdf

47. The Dryad Digital Repository: Published evolutionary data as part of the greater data ecosystem
Todd Vision
Nature Precedings (2010-06-30) https://doi.org/ghk2km
DOI: 10.1038/npre.2010.4595.1

48. FigShare
Jatinder Singh
Journal of Pharmacology and Pharmacotherapeutics (2011) https://doi.org/cvqv67
DOI: 10.4103/0976-500x.81919 · PMID: 21772785 · PMCID: PMC3127351

49. Zenodo, an Archive and Publishing Repository: A tale of two herbarium specimen pilot projects
Mathias Dillen, Quentin Groom, Donat Agosti, Lars Nielsen
Biodiversity Information Science and Standards (2019-06-18) https://doi.org/ghk2kn
DOI: 10.3897/biss.3.37080

50. Open Science Framework (OSF)
Erin D. Foster, MSLS, Ariel Deardorff, MLIS
Journal of the Medical Library Association (2017-04-04) https://doi.org/gfxvhq
DOI: 10.5195/jmla.2017.88 · PMCID: PMC5370619

51. On Reproducible AI: Towards Reproducible Research, Open Science, and Digital Scholarship in AI Publications
Odd Erik Gundersen, Yolanda Gil, David W. Aha
AI Magazine (2018-09-28) https://doi.org/gfcqt6
DOI: 10.1609/aimag.v39i3.2816

52. Minimum information about a microarray experiment (MIAME)—toward standards for microarray data
Alvis Brazma, Pascal Hingamp, John Quackenbush, Gavin Sherlock, Paul Spellman, Chris Stoeckert, John Aach, Wilhelm Ansorge, Catherine A. Ball, Helen C. Causton, … Martin Vingron
Nature Genetics (2001-12) https://doi.org/ck257n
DOI: 10.1038/ng1201-365 · PMID: 11726920

53. Tackling the widespread and critical impact of batch effects in high-throughput data
Jeffrey T. Leek, Robert B. Scharpf, Héctor Corrada Bravo, David Simcha, Benjamin Langmead, W. Evan Johnson, Donald Geman, Keith Baggerly, Rafael A. Irizarry
Nature Reviews Genetics (2010-09-14) https://doi.org/cfr324
DOI: 10.1038/nrg2825 · PMID: 20838408 · PMCID: PMC3880143

54. Neural Networks: Tricks of the Trade
Lecture Notes in Computer Science
(2012) https://doi.org/gfvtvt
DOI: 10.1007/978-3-642-35289-8

55. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
Shaojie Bai, J. Zico Kolter, Vladlen Koltun
arXiv (2018-04-20) https://arxiv.org/abs/1803.01271

56. Machine learning and AI-based approaches for bioactive ligand discovery and GPCR-ligand recognition
Sebastian Raschka, Benjamin Kaufman
Methods (2020-08) https://doi.org/ghk2mf
DOI: 10.1016/j.ymeth.2020.06.016 · PMID: 32645448

57. Deep learning
Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Nature (2015-05-27) https://doi.org/bmqp
DOI: 10.1038/nature14539 · PMID: 26017442

58. A Neural-Network Solution to the Concentrator Assignment Problem
Gene Tagliarini, Edward Page
Neural Information Processing Systems (1988) https://proceedings.neurips.cc/paper/1987/file/1679091c5a880faf6fb5e6087eb1b2dc-Paper.pdf

59. ImageNet Large Scale Visual Recognition Challenge
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, … Li Fei-Fei
International Journal of Computer Vision (2015-04-11) https://doi.org/gcgk7w
DOI: 10.1007/s11263-015-0816-y

60. High-Throughput Classification of Radiographs Using Deep Convolutional Neural Networks
Alvin Rajkomar, Sneha Lingam, Andrew G. Taylor, Michael Blum, John Mongan
Journal of Digital Imaging (2016-10-11) https://doi.org/gcgk7v
DOI: 10.1007/s10278-016-9914-9 · PMID: 27730417 · PMCID: PMC5267603

61. Kipoi: accelerating the community exchange and reuse of predictive models for genomics
Žiga Avsec, Roman Kreuzhuber, Johnny Israeli, Nancy Xu, Jun Cheng, Avanti Shrikumar, Abhimanyu Banerjee, Daniel S. Kim, Lara Urban, Anshul Kundaje, … Julien Gagneur
Cold Spring Harbor Laboratory (2018-07-24) https://doi.org/gd24sx
DOI: 10.1101/375345

62. CNN Features Off-the-Shelf: An Astounding Baseline for Recognition
Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, Stefan Carlsson
Institute of Electrical and Electronics Engineers (IEEE) (2014-06) https://doi.org/f3np4s
DOI: 10.1109/cvprw.2014.131

63. Fast and robust segmentation of white blood cell images by self-supervised learning
Xin Zheng, Yong Wang, Guoyou Wang, Jianguo Liu
Micron (2018-04) https://doi.org/gdfh65
DOI: 10.1016/j.micron.2018.01.010 · PMID: 29425969

64. Deep Model Based Transfer and Multi-Task Learning for Biological Image Analysis
Wenlu Zhang, Rongjian Li, Tao Zeng, Qian Sun, Sudhir Kumar, Jieping Ye, Shuiwang Ji
IEEE Transactions on Big Data (2020-06-01) https://doi.org/gfvs28
DOI: 10.1109/tbdata.2016.2573280

65. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function
Moshe Leshno, Vladimir Ya. Lin, Allan Pinkus, Shimon Schocken
Neural Networks (1993-01) https://doi.org/bjjdg2
DOI: 10.1016/s0893-6080(05)80131-5

66. Dropout: a simple way to prevent neural networks from overfitting
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov
The Journal of Machine Learning Research (2014-01-01) http://dl.acm.org/citation.cfm?id=2670313

67. Batch normalization: accelerating deep network training by reducing internal covariate shift
Sergey Ioffe, Christian Szegedy
Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (2015-07-06) https://dl.acm.org/citation.cfm?id=3045118.3045167

68. Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning
Sebastian Raschka
arXiv (2020-11-12) https://arxiv.org/abs/1811.12808

69. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms
Thomas G. Dietterich
Neural Computation (1998-10-01) https://doi.org/fqc9w5
DOI: 10.1162/089976698300017197 · PMID: 9744903

70. Dropout: A Simple Way to Prevent Neural Networks from Overfitting
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov
Journal of Machine Learning Research (2014) http://jmlr.org/papers/v15/srivastava14a.html

71. A simple weight decay can improve generalization
Anders Krogh, John A. Hertz
Proceedings of the 4th International Conference on Neural Information Processing Systems (1991-12-02) http://dl.acm.org/citation.cfm?id=2986916.2987033
ISBN: 9781558602229

72. Adversarial Controls for Scientific Machine Learning
Kangway V. Chuang, Michael J. Keiser
ACS Chemical Biology (2018-10-19) https://doi.org/gfk9mh
DOI: 10.1021/acschembio.8b00881 · PMID: 30336670

73. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets.
Takaya Saito, Marc Rehmsmeier
PloS one (2015-03-04) https://www.ncbi.nlm.nih.gov/pubmed/25738806
DOI: 10.1371/journal.pone.0118432 · PMID: 25738806 · PMCID: PMC4349800

74. Comparison of Deep Learning With Multiple Machine Learning Methods and Metrics Using Diverse Drug Discovery Data Sets
Alexandru Korotcov, Valery Tkachenko, Daniel P. Russo, Sean Ekins
Molecular Pharmaceutics (2017-11-13) https://doi.org/gcj4p2
DOI: 10.1021/acs.molpharmaceut.7b00578 · PMID: 29096442 · PMCID: PMC5741413

75. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study
John R. Zech, Marcus A. Badgeley, Manway Liu, Anthony B. Costa, Joseph J. Titano, Eric Karl Oermann
PLOS Medicine (2018-11-06) https://doi.org/gfj53h
DOI: 10.1371/journal.pmed.1002683 · PMID: 30399157 · PMCID: PMC6219764

76. Correct machine learning on protein sequences: a peer-reviewing perspective
Ian Walsh, Gianluca Pollastri, Silvio C. E. Tosatto
Briefings in Bioinformatics (2016-09) https://doi.org/f89ms7
DOI: 10.1093/bib/bbv082 · PMID: 26411473

77. Machine Learning to Identify Flexibility Signatures of Class A GPCR Inhibition
Joseph Bemister-Buffington, Alex J. Wolf, Sebastian Raschka, Leslie A. Kuhn
Biomolecules (2020-03-14) https://doi.org/ghm636
DOI: 10.3390/biom10030454 · PMID: 32183371 · PMCID: PMC7175283

78. Automated Inference of Chemical Discriminants of Biological Activity
Sebastian Raschka, Anne M. Scott, Mar Huertas, Weiming Li, Leslie A. Kuhn
Methods in Molecular Biology (2018) https://doi.org/ghk2pg
DOI: 10.1007/978-1-4939-7756-7_16 · PMID: 29594779

79. Deep Learning for Health Informatics
Daniele Ravi, Charence Wong, Fani Deligianni, Melissa Berthelot, Javier Andreu-Perez, Benny Lo, Guang-Zhong Yang
IEEE Journal of Biomedical and Health Informatics (2017-01) https://doi.org/gfgtzx
DOI: 10.1109/jbhi.2016.2636665 · PMID: 28055930

80. Towards trustable machine learning
Nature Biomedical Engineering
(2018-10-10) https://doi.org/gfw9cn
DOI: 10.1038/s41551-018-0315-x · PMID: 31015650

81. On Interpretability of Artificial Neural Networks: A Survey
Fenglei Fan, Jinjun Xiong, Mengzhou Li, Ge Wang
arXiv (2020-12-02) https://arxiv.org/abs/2001.02522

82. Interpretable Machine Learning
Christoph Molnar
https://christophm.github.io/interpretable-ml-book/

83. Impact of Accuracy on Model Interpretations
Brian Liu, Madeleine Udell
arXiv (2020-11-20) https://arxiv.org/abs/2011.09903

84. An evaluation of machine-learning methods for predicting pneumonia mortality
Gregory F. Cooper, Constantin F. Aliferis, Richard Ambrosino, John Aronis, Bruce G. Buchanan, Richard Caruana, Michael J. Fine, Clark Glymour, Geoffrey Gordon, Barbara H. Hanusa, … Peter Spirtes
Artificial Intelligence in Medicine (1997-02) https://doi.org/b6vnmd
DOI: 10.1016/s0933-3657(96)00367-3

85. Intelligible Models for HealthCare
Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, Noemie Elhadad
Association for Computing Machinery (ACM) (2015-08-10) https://doi.org/gftgxk
DOI: 10.1145/2783258.2788613

86. Deep Ethical Learning: Taking the Interplay of Human and Artificial Intelligence Seriously
Anita Ho
Hastings Center Report (2019-01) https://doi.org/ggsqtt
DOI: 10.1002/hast.977 · PMID: 30790317

87. The Legal And Ethical Concerns That Arise From Using Complex Predictive Analytics In Health Care
I. Glenn Cohen, Ruben Amarasingham, Anand Shah, Bin Xie, Bernard Lo
Health Affairs (2014-07) https://doi.org/f6dggf
DOI: 10.1377/hlthaff.2014.0048 · PMID: 25006139

88. Model Cards for Model Reporting
Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, Timnit Gebru
Association for Computing Machinery (ACM) (2019-01-29) https://doi.org/gftgjg
DOI: 10.1145/3287560.3287596

89. American Society for Bioethics and Humanities https://asbh.org/

90. 10 organizations leading the way in ethical AI — SAGE Ocean | Big Data, New Tech, Social Science (2021-01-12) https://web.archive.org/web/20210112231619/https://ocean.sagepub.com/blog/10-organizations-leading-the-way-in-ethical-ai

91. Artificial Intelligence, Ethics, and Society — Home https://www.aies-conference.com/2021/

92. Ten simple rules for responsible big data research
Matthew Zook, Solon Barocas, danah boyd, Kate Crawford, Emily Keller, Seeta Peña Gangadharan, Alyssa Goodman, Rachelle Hollander, Barbara A. Koenig, Jacob Metcalf, … Frank Pasquale
PLOS Computational Biology (2017-03-30) https://doi.org/gdqfcn
DOI: 10.1371/journal.pcbi.1005399 · PMID: 28358831 · PMCID: PMC5373508

93. Responsible, practical genomic data sharing that accelerates research
James Brian Byrd, Anna C. Greene, Deepashree Venkatesh Prasad, Xiaoqian Jiang, Casey S. Greene
Nature Reviews Genetics (2020-07-21) https://doi.org/gg7c57
DOI: 10.1038/s41576-020-0257-5 · PMID: 32694666

94. Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures
Matt Fredrikson, Somesh Jha, Thomas Ristenpart
Association for Computing Machinery (ACM) (2015-10-12) https://doi.org/cwdm
DOI: 10.1145/2810103.2813677

95. Membership Inference Attacks against Machine Learning Models
Reza Shokri, Marco Stronati, Congzheng Song, Vitaly Shmatikov
arXiv (2017-04-04) https://arxiv.org/abs/1610.05820

96. Convolutional Networks on Graphs for Learning Molecular Fingerprints
David Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael Gómez-Bombarelli, Timothy Hirzel, Alán Aspuru-Guzik, Ryan P. Adams
arXiv (2015-11-04) https://arxiv.org/abs/1509.09292

97. SIG-DB: Leveraging homomorphic encryption to securely interrogate privately held genomic databases
Alexander J. Titus, Audrey Flower, Patrick Hagerty, Paul Gamble, Charlie Lewis, Todd Stavish, Kevin P. O’Connell, Greg Shipley, Stephanie M. Rogers
PLOS Computational Biology (2018-09-04) https://doi.org/gd6xd5
DOI: 10.1371/journal.pcbi.1006454 · PMID: 30180163 · PMCID: PMC6138421

98. Towards the AlexNet Moment for Homomorphic Encryption: HCNN, theFirst Homomorphic CNN on Encrypted Data with GPUs
Ahmad Al Badawi, Jin Chao, Jie Lin, Chan Fook Mun, Jun Jie Sim, Benjamin Hong Meng Tan, Xiao Nan, Khin Mi Mi Aung, Vijay Ramaseshan Chandrasekhar
arXiv (2020-08-20) https://arxiv.org/abs/1811.00778

99. A generic framework for privacy preserving deep learning
Theo Ryffel, Andrew Trask, Morten Dahl, Bobby Wagner, Jason Mancuso, Daniel Rueckert, Jonathan Passerat-Palmbach
arXiv (2018-11-14) https://arxiv.org/abs/1811.04017

100. Deep Learning with Differential Privacy
Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, Li Zhang
Association for Computing Machinery (ACM) (2016-10-24) https://doi.org/gcrnp3
DOI: 10.1145/2976749.2978318

101. Privacy-preserving generative deep neural networks support clinical data sharing
Brett K. Beaulieu-Jones, Zhiwei Steven Wu, Chris Williams, Ran Lee, Sanjeev P. Bhavnani, James Brian Byrd, Casey S. Greene
Cold Spring Harbor Laboratory (2018-12-20) https://doi.org/gcnzrn
DOI: 10.1101/159756

102. Privacy-Preserving Distributed Deep Learning for Clinical Data
Brett K. Beaulieu-Jones, William Yuan, Samuel G. Finlayson, Zhiwei Steven Wu
arXiv (2018-12-05) https://arxiv.org/abs/1812.01484

Authors