Saturday, January 10, 2015

How Good is Your Crystal Ball: Utility, Methodology, and Validity of Clinical Prediction Models

"In God we trust; all others must bring data." 
-William Edwards Deming (1900-1993), 
American Engineer, Statistician, and Quality Guru

Clinical Prediction Models are increasingly used in clinical practice to provide diagnosis, prognosis, and anomaly detection. These models form a core component of Intelligent Systems in clinical care. In addition to traditional biomarkers, new predictors are emerging from research in genomics, proteomics, and imaging [1]. The predictive accuracy of Clinical Prediction Models continues to improve also due to the use of novel techniques and tools in the science of Statistical Learning. In some cases, these methods are capable of providing better predictive performance than traditional regression methods, albeit at the cost of decreased interpretability. Open source statistical computing tools like R [2] have also made it relatively easy for Data Scientists to apply these novel techniques in practice.

Furthermore, the ubiquitous use of wearable sensors by an increasing number of patients (part of a larger trend called the "Internet of Things" or IoT) will generate an abundance of data that will also contribute to improvements in the predictive accuracy of these models. One interesting applications of Clinical Prediction Models with wearable sensors is the use of Machine Learning algorithms for anomaly detection in physiological time-series in real time.

Several Clinical Prediction Models have been published in the biomedical literature in recent years. Some have even been introduced into clinical practice. However, there are serious concerns about the credibility and validity of these models. A systematic review of the methodology and reporting of multivariable clinical prediction models reported the following:

The validation studies were characterized by poor design, inappropriate handling and acknowledgment of missing data and one of the most key performance measures of prediction models i.e. calibration often omitted from the publication. [3]
In this first post of the year, I discuss the usefulness as well as state-of-the art techniques and recommended methodologies for the development and validation of Clinical Prediction Models. I also explore some traditional and new quantitative performance measures for clinical prediction models.

Why do we need Clinical Prediction Models?

Clinical Prediction Models provide absolute risk prediction for conditions such as diabetes, kidney disease, cancer, cardiovascular disease, and depression. Other examples include predicting patient treatment response in cancer care, 30-day mortality for patients with an acute myocardial infection (AMI), 30-day emergency admission to hospital, and 30-day readmission. Well-known Clinical Prediction Model development efforts include several risk prediction algorithms created by the QResearch project [4] in the United Kingdom (UK) and the cardiovascular risk functions developed by the Framingham Heart Study [5] in the United States (US).

These risk predictions are clinically useful for a number of reasons. They provide risk stratification for effective population health management, a key component of the accountable care organization (ACO) delivery model. They have the potential to reduce healthcare costs through early screening and the delivery of preventive services such as those recommended by the US Preventive Services Task Force (USPSTF). In clinical practice, Clinical Prediction Models can support clinician decision making during diagnostic work-up and test ordering [6].

Clinical Prediction Models enable Personalized Medicine since the predictions are made based on the clinical data of individual patients. Clinical Prediction Models also support shared decision making between patients and their providers about the benefits and harms of various treatment options and patients preferences and personal values. 

Thanks to significant investments in biomedical research in recent years, the number of treatment options for any specific disease continues to increase. Furthermore, with the discovery of new biomarkers from imaging, genomics, and proteomics research, the number of data types that should be considered in clinical decision making will surpass the information processing capacities of the human brain [7]. An average human can only hold a maximum of 7 ± 2 objects in working memory [8]. Francois de La Rochefoucauld (1613-1680), a noted French author once said: "Everyone complains of his memory, and no one complains of his judgement".

Research in noninvasive neuroimaging is improving our understanding of how the brain works and is leading to the discovery of neurological markers (neuromarkers) which are being used to create clinical prediction models for use in mental health and substance abuse treatment. The emerging field of neuroprognosis is leveraging these neuromarkers to predict patients' future relapse or treatment response to pharmacological and behavioral treatment [28]. The emerging Deep Learning techniques hold great promise in the field of neuroimaging [33].

A prospective study at the MAASTRO Clinic of Maastricht University Medical Center in The Netherlands compared treatment outcome predictions by experienced radiation oncologists (ROs) against those made by Clinical Prediction Models. The study found that the models "substantially outperformed ROs’ predictions and guideline-based recommendations currently used in clinical practice" [9]. According to Dr. Cary Oberije, a researcher at the MAASTRO Clinic who presented the findings at the 2nd Forum of the European Society for Radiotherapy and Oncology (ESTRO): 

If models based on patient, tumor and treatment characteristics already out-perform the doctors, then it is unethical to make treatment decisions based solely on the doctors’ opinions. We believe models should be implemented in clinical practice to guide decisions. [10]
Lastly, predictive models can be used for creating simulations. Simulations are used extensively in the aerospace industry during the design and training phases of aircraft systems. For example, predictive models can be used in Monte Carlo simulations to model the cost of treatment for a population of patients with diabetes [30].


The Importance of the "No Free Lunch" Theorem in Predictive Modeling

When it comes to the choice of statistical learning method for a given modeling task, I subscribe to the "No Free Lunch" theorem [11]. According to the "No Free Lunch" theorem, there is no one single model builder which will produce the model with the best performance for all modeling tasks. The modeler should be familiar with and try multiple model builders and select the model with the best performance for the prediction task and data set at hand. 

Beyond the traditional regression analyses (linear, logistic, and Cox regression) [23] typically used in the medical field, new sophisticated Statistical Learning methods are now available. These algorithms include Neural Networks, Support Vector Machines (SVMs), Bayesian Networks, Multivariate Adaptive Regression Splines (MARS), and Boosted Trees to name a few. For example, Jayasurya et al. compared the performance of Bayesian Network (BN) and support vector machine (SVM) models for two-year survival prediction in lung cancer patients treated with radiotherapy. They concluded that BN models had an overall better performance than SVM models when handling missing data [12] which is often the case in medical data.

The R [2] packages Caret [13] and Rattle [14] provide functions for fitting models using several algorithms on a data set. Caret provides utilities for comparing their results.

Transparent and Reproducible Modeling

The credibility of a Clinical Prediction Model comes from a transparent, reproducible, and peer-reviewed modeling approach based on methodological rigor. Ideally, Clinical Prediction Models should be developed as open source software so that anyone can evaluate their underlying quality. Free and publicly available de-identified data sets should be available to predictive modelers and researchers. Commercial entities should have the right to keep their models proprietary as long as there is a third-party independent validation of the model as is done so well for reliable avionics software in the aviation industry.

The author(s) of the model should document the data pre-processing steps that have been applied to the original data set during the analysis and model building process. The R package pmmlTransformations [29] provides an interoperable and computable representation of the data pre-processing steps that are applied to the input data prior to modeling. Supported data transformation elements include: normalization, discretization, value mappings, and functions.

Open source tools like knitr [13] and R Markdown [16] simplify the task of creating well-documented and reproducible models by leveraging the typesetting capabilities of LATEX in combination with R code for the dynamic generation of documents, presentations, and reports in multiple formats like HTML, PDF, and Word.

Recently, a minimum set of guidelines called the TRIPOD Statement for reporting clinical prediction models has been published and endorsed by leading scientific publications and clinical prediction modeling researchers. TRIPOD stands for Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis. The TRIPOD Statement includes a 22-item checklist to help practioners in reporting how the study was designed, conducted, analyzed and interpreted [31].

Development Methodology and Validation

Steyerberg and Vergouwe suggest the following steps for the development of Clinical Prediction Models: "(i) consideration of the research question and initial data inspection; (ii) coding of predictors; (iii) model specification; (iv) model estimation; (v) evaluation of model performance; (vi) internal validation; and (vii) model presentation" [1]. An important distinction is made between internal validity and external validity. Internal validity refers to the reproducibility of the model to the patient population whose clinical data were used to train the model. External validity refers to the generalizability or extrapolation of the model to previously unseen clinical data from other patient populations (e.g., patient populations from a different country, time period, region, or clinical site). 

Handling of Missing Data

Missing data is a common issue in clinical data sets used for predictive modeling. Removing samples with missing values (also referred to as "complete case" analysis) can lead to biased results if the deleted samples are not a completely random subset of the original data set. Mean imputation and missing-indicator methods can also produce biased results and are not recommended. Multiple imputation has been shown to produce correctly estimated standard errors and confidence intervals and is the recommended approach [32]. 

Validation using Resampling Techniques 

Over-fitting is always a concern in model building. Traditional data splitting techniques include: random sampling, splitting by time period (if the data set is large enough), stratified random sampling (to account for severe class imbalance), and maximum dissimilarity sampling [13]. However, more recent resampling techniques (as opposed to simple random training/test splits of the data) can provide more reliable estimates of model performance. Resampling techniques include:
  • K-fold cross-validation
  • Repeated k-fold cross-validation
  • Leave-one-out cross-validation (LOOCV)
  • Repeated training/test splits or "Monte Carlo cross-validation"
  • The Bootstrap and its variants such as the ".632 method" and the ".632+ method".

Model builders typically have one or more tuning parameters. For example, the numbers of neighbors to be used with a K-nearest neighbor (kNN) model builder is a tuning parameter that can affect model performance. For each candidate value of K, the training data is resampled several times and an aggregated performance profile is generated and evaluated to determine the optimal value [17, 18].

An important issue to be aware of during model tuning is the so called "Bias-Variance Trade-off". Models with high variance can lead to over-fitting although they may have low bias. The challenge is to arrive at a model with low variance and low squared bias [18].

Quantitative Measures of Model Performance


The Root Mean Squared Error (RMSE)

In general, quantitative measures of quality depends on whether or not the outcome is continuous. When the outcome is continuous, the root mean squared error (RMSE) is typically used. The RMSE is a measure of the model residuals. The model residuals are the differences between the observed and the predicted values.

The Coefficient of Determination or R-squared

Another measure of performance in regression models is the coefficient of determination or R-squared. The R-squared can be obtained by computing the correlation coefficient between the observed and predicted values and by squaring it. The R-squared can have values between 0 to 1. A value of 1 indicates a perfect fit of the model to the data.


When the outcome is not numeric such as in classification models, the goal is to obtain predicted class probabilities. Calibration is a measure of how predicted class probabilities reflect the true probability of the outcome. For example, for a prediction of 60% of chance of positive outcome for a patient, the observed proportion should be 60 patients with positive outcome per 100 "similar patients". A calibration plot displays predicted class probabilities on the x-axis and the observed probabilities on the y-axis. Well-calibrated predictions are on the 45 degrees line. The observed probabilities can be plotted by deciles of predicted probabilities to compare their means.

Confusion Matrix 

The Confusion Matrix (also known as a Contingency Table or Error Matrix) is a simple cross-tabulation of the observed and predicted classes for the data. The Confusion Matrix can be represented as a table with two rows and two columns that reports the number of false positives, false negatives, true positives, and true negatives.

The confusion matrix can also display the overall accuracy rate or error rate. However, the accuracy rate is not a reliable measure of performance because its value can be misleading in the case of severe class imbalance (very low or very high prevalence). 

Common Measures

The following are performance metrics commonly found in the biomedical literature and equations for computing their values:

  • Sensitivity or True Positive Rate (TPR): TPR = TP/(TP + FN)
  • Specificity (SPC) or True Negative Rate: SPC = TN/(FP + TN)
  • Precision or Positive Predicted Value: PPV = TP/(TP + FP)
  • Negative Predicted Value: NPV = TN/(FN + TN)
  • Fall-Out or False Positive Rate: FPR = FP/(FP + TN)
  • Accuracy = (TP + TN)/(TP + TN + FN + FP) = 1 – Error Rate
  • F-Measure or F-Score = 2TP/(2TP + FP + FN).

There is usually a trade-off between the specificity and the sensitivity. This trade-off can be evaluated using the Receiver Operating Characteristic (ROC) curve (more on that later). A cautionary note is that the PPV and the NPV depends on the prevalence which can vary across patient populations. 

Kappa Statitistics

The Kappa statistics is a measure of the difference between the observed accuracy of a model and the expected accuracy. The latter is the accuracy that can be obtained by random chance alone. Compared with the overall accuracy, the Kappa statistics is more resilient to severe class imbalance. It can be computed using the following formula:

Kappa = (observed accuracy - expected accuracy)/(1 - expected accuracy)

The range of value -1 to 1. A value of 1 represents perfect agreement. A value of 0 indicates agreement no better than what would be obtained by random chance. Most values fall between 0 and 1. 

The Receiver Operating Characteristic Curve (ROC) and associated Area under the Curve (AUC)

The ROC plots the sensitivity (true-positive rate) against 1 – specificity (false-positive rate) for a range of cut-off values. The AUC is a key indicator of model performance in classification models. Larger values indicate better performance. An advantage of the ROC curve is that it insensitive to class imbalance.

Youden's J Index 

The Youden’s J Index can be calculated using the following formula:

J = Sensitivity + Specificity - 1

The Youden's Index (J) is essentially the difference between the true positive rate (TPR) and the false positive rate (FPR) [19]. The range of value is 0 to 1. The optimal classification cut-off point can be determined by the maximum value of the Youden's Index (the height above the chance line) on the ROC curve.

Equivocal Zones

In addition to an optimal cut-off value, an "equivocal" zone can be defined as well. For predictions that fall in to this zone, the sample is classified as "equivocal" (meaning class membership is indeterminate) [20, 21].

Lift Charts

The lift is a measure of the relative performance of the model against a baseline like random guessing or a non-informative model. A Lift Chart plots the cumulative lift values on the y-axis against the percentage of samples evaluated on the x-axis. The lift function in the Caret package calculates the lift as the ratio of the percentage of samples (in each approximately equal split of the data) predicted as positive for a given class over the same percentage in the entire data set [13].


Performance Measures for Regression Analysis

Steyerberg and Vergouwe suggest the following performance measures for regression analysis: "calibration-in-the-large, or the model intercept (A); calibration slope (B); discrimination, with a concordance statistic (C); and clinical usefulness, with decision-curve analysis (D)" [1].

For generalized linear models, calibration-in-the large is related to the intercept and compares the mean of predictions to the mean of outcomes. The mean of predictions and the mean of outcomes are equal during internal validation with resampling techniques, but could differ during external validation on previously unseen data. During internal validation, the calibration slope can be used as a shrinkage factor. During external validation on previously unseen data, the calibration slope could be less than one due to over-fitting [6]. This methodology for assessing the calibration of a logistic regression model was first proposed by Cox in 1958 [22]. When the intercept and the slope do not significantly differ from 0 and 1 respectively, the model is considered to have good calibration.

For binary outcomes, the concordance statistics is the area under the ROC (AUC). 

The TRIPOD Statement's Explanation and Elaboration paper [31] also recommends calibration, discrimination, and Net Benefit as key performance indicators for regression models. Next, we discuss decision-curve analysis and Net Benefit. 

Decision Curve Analysis and Net Benefits (NB)

In clinical practice, a cut-off or threshold value of the predicted class probability is needed to assist clinicians in decision making. A default cut-off of 50% implies that benefits (e.g., remission and improved functional status and quality of life) and harms (e.g., severity of side-effects and costs) are weighted equally. Since this assumption is rarely correct in medicine, such a cut-off value would not be too useful to clinicians in practice [6]. For example, the value of true-positive classifications (e.g., patients with the disease correctly diagnosed as having the disease) and false-positive classifications (e.g., patients without the disease incorrectly diagnosed as having the disease) are typically not equal. Also the determination of benefits and harms of any decision could be driven by the specific context of the patient including the patient's preferences and values (shared decision making).

Vickers and Elkin introduced a decision-analytic approach called the "decision curve" which evaluates the Net Benefit (NB) of the model over a range of cut-off values [24]. The NB can be calculated using the formula:

NB = (TP - w * FP)/N 

where TP is the number of true-positive classifications, FP the number of false-positive classifications, N the patient population, and w (weight) the ratio of harm to benefit. The latter is calculated as the odds of the cut-off. For example, a cut-off value of 10% indicates that the value of a true-positive (TP) is 9 times higher than the value of a false-positive (FP). The Decision Curve approach is an important tool for measuring value in the transition from a fee-for-service to a value-driven care delivery system.


Bending the Cost Curve

Controlling healthcare costs remain a challenge for many countries. Drummond and Holte proposed another method called the Cost Curve which can be used to visualize and compare the performance of classifiers based on the combination of misclassification costs and class distributions [25]. The Cost Curve plots the normalized expected misclassification cost (NEC) on the y-axis and the probability cost PC(+) on the x-axis. The probability cost PC(+) represents the combination of the two misclassification costs and the class distribution and can be calculated with the following formula: 

PC(+) = (p(+)C(-|+))/(p(+)C(-|+) + (1 - p(+))C(+|-))

where p(+) is the class distribution (the probability that a given instance is positive), C(-|+) is the cost of a false negative, and C(+|-) is the cost of a false positive. The range of PC(+) values is 0 to 1.

The "Normalized Expected Cost" (NEC) can be calculated with the following formula:

NEC = FN * PC(+) + FP * (1 - PC(+))

where FN is the false negative rate and FP is the false positive rate. The range of NEC values is 0 to 1. There is bidirectional point/line duality [26] between ROC curves and Cost Curves. The point (FP, TP) in ROC space is a line in cost space which joins the points (0, FP) and (1, FN) [27].


Model Presentation and Deployment

Model presentation techniques include traditional score charts, nomograms, and clinical rules [6]. However Clinical Prediction Models are easier to use and maintain when deployed as scoring services (part of a service-oriented software architecture) and integrated into Clinical Decision Support (CDS) systems. The scoring service can be deployed in the cloud to allow integration with multiple client clinical systems. The Data Mining Group (DMG) Predictive Model Markup Language (PMML) specification supports the interoperable deployment of predictive models in heterogeneous software environments.

Visual Analytics or data visualization techniques can also play an important role in the effective presentation of Clinical Prediction Models to nonstatisticians particularly in the context of shared decision making.



[1] Ewout W. Steyerberg, Yvonne Vergouwe. Towards better clinical prediction models: seven steps for development and an ABCD for validation. European Heart Journal 2014 Aug 1;35(29):1925-31 

[2] R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL

[3] Collins GS, de Groot JA, Dutton S et al (2014) External validation of multivariable prediction models: a systematic review of methodological conduct and reporting. BMC Med Res Methodol 14:40.

[4] Hippisley-Cox J, Coupland C, Brindle P. The performance of seven QPrediction risk scores in an independent external sample of patients from general practice: a validation study. BMJ Open. 2014 Aug 28;4(8)

[5] Dawber TR, Meadors GF, Moore FEJ: Epidemiological approaches to heart disease: the Framingham Study. Am J Public Health 1951, 41:279-286.

[6] Ewout W. Steyerberg. Clinical Prediction Models. A Practical Approach to Development, Validation, and Updating. New York: Springer, 2010.

[7] Stead WW, Searle JR, Fessler HE, Smith JW, Shortliffe EH. Biomedical informatics: changing what physicians need to know and how they learn. Acad Med. 2011 Apr;86(4):429-34.

[8] Miller GA. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychol Rev. 1956;63:81–97.

[9] Oberije C, Nalbantov G, Dekker A, Boersma L, Borger J, Reymen B, van Baardwijk A, Wanders R, De Ruysscher D, Meyerbeer E, Dingemans AM, Lambin P. A prospective study comparing the predictions of doctors versus models for treatment outcome of lung cancer patients: a step toward individualized care and shared decision making. Radiother Oncol. 2014 Jul;112(1):37-43

[10] European Society for Radiotherapy and Oncology (ESTRO). "Mathematical models out-perform doctors in predicting cancer patients' responses to treatment." ScienceDaily. (accessed January 3, 2015).

[11] Wolpert D (1996). "The Lack of a priori Distinctions Between Learning Algorithms." Neural Computation, 8(7), 1341–1390.

[12] Jayasurya K, Fung G, Yu S, Dehing-Oberije C, De Ruysscher D, Hope A, De Neve W, Lievens Y, Lambin P, Dekker AL. Comparison of Bayesian network and support vector machine models for two-year survival prediction in lung cancer patients treated with radiotherapy. Med Phys. 2010 Apr;37(4):1401-7.

[13] Max Kuhn. Contributions from Jed Wing, Steve Weston, Andre Williams, Chris Keefer and Allan Engelhardt (2012). caret: Classification and Regression Training. R package version 5.15-044.

[14] Williams GJ (2014). rattle: Graphical user interface for data mining in R. R package version 3.1.4, URL

[15] Yihui Xie (2014). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.8.

[16] R Studio (2013), Using R Markdown with Rstudio,

[17] Max Kuhn, Kjell Johnson. Applied Predictive Modeling. New York: Springer, 2013.

[18] Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning with Applications in R. New York: Springer, 2013.

[19] Youden W (1950). "Index for Rating Diagnostic Tests." Cancer, 3(1), 32–35.

[20] Us Food and Drug Administration. Guidance for Industry and FDA Staff - Class II Special Controls Guidance Document: Cardiac Allograft Gene Expression Profiling Test Systems. Accessed January 10, 2015.

[21] Max Kuhn. Equivocal Zones. R Bloggers. Accessed January 10, 2015.

[22] Cox DR. Two further applications of a model for binary regression. Biometrika 1958; 45:562-565.

[23] Frank E. Harel, Jr. Regression Modeling Strategies. With Applications to Linear Models, Logistic Regression, and Survival Analysis. New York: Springer, 2010.

[24] Vickers AJ, Elkin EB. Decision Curve Analysis: a novel method for evaluating prediction models. Med Decis Making. 2006; 26(6):565-75

[25] Robert C. Holte and Chris Drummond. Cost-sensitive Classifier Evaluation using Cost Curves. Advances in Knowledge Discovery and Data Mining Lecture Notes in Computer Science Volume 5012, 2008, pp 26-29.

[26] Preparata, F. P., & Shamos, M. I. (1988). Computational Geometry, An Introduction, Text and Monographs in Computer Science. New York: Springer-Verlag.

[27] Chris Drummond, Robert C. Holte. Cost curves: An improved method for visualizing classifier performance. Mach Learn (2006) 65:95–130.

[28]  Gabrieli, John D.E., Ghosh, Satrajit S., Whitfield-Gabrieli, Susan. Prediction as a Humanitarian and Pragmatic Contribution from Human Cognitive Neuroscience. Neuron, Volume 85, Issue 1, 11-26.

[29] Tridivesh Jena, Wen Ching Lin (2014). Package pmmlTransformations. R package version 1.2.2.

[30] Svetlana Levitan, Richard Cohen, Vladimir Shklover. PMML in Simulation. PMML'13, August 11 2013, Chicago, Illinois, USA. 

[31] Moons KG, Altman DG, Reitsma JB, Ioannidis JP, Macaskill P, Steyerberg EW, et al. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration. Ann Intern Med. 2015;162:W1-W73.

[32 Donders AR, van der Heijden GJ, Stijnen T, Moons KG. Review: a gentle introduction to imputation of missing values. J Clin Epidemiol. 2006; 59:1087-91.  

[33] Plis SM, Hjelm DR, Salakhutdinov R, Allen EA, Bockholt HJ, Long JD, Johnson HJ, Paulsen JS, Turner JA and Calhoun VD (2014) Deep learning for neuroimaging: a validation study. Front. Neurosci. 8:229. 

Sunday, November 2, 2014

Toward a Reference Architecture for Intelligent Systems in Clinical Care

A Software Architecture for Precision Medicine

Intelligent systems in clinical care leverage the latest innovations in machine learning, real-time data stream mining, visual analytics, natural language processing, ontologies, production rule systems, and cloud computing to provide clinicians with the best knowledge and information at the point of care for effective clinical decision making. In this post, I propose a unified open reference architecture that combines all these technologies into a hybrid cognitive system for clinical decision support. Indeed, truly intelligent systems are capable of reasoning. The goal is not to replace clinicians, but instead to provide them with cognitive support during clinical decision making. Furthermore, Intelligent Personal Assistants (IPAs) such as Apple's Siri, Google's Google Now, and Microsoft's Cortana have raised our expectations on how intelligent systems interact with users through voice and natural language.

In the strict sense of the term, a reference architecture should be abstracted away from concrete technology implementation. However in order to enable a better understanding of the proposed approach, I take liberty in explaining how available open source software can be used to realize the intent of the architecture. There is an urgent need for an open and interoperable architecture which can be deployed across devices and platforms. Unfortunately, this is not the case today with solutions like Apple's HealthKit and ResearchKit.

The specific open source software mentioned in this post can be substituted with other tools which provide similar capabilities. The following diagram is a depiction of the architecture (click to enlarge).


Clinical Data Sources

Clinical data sources are represented on the left of the architecture diagram. Examples include electronic medical record systems (EMR) commonly used in routine clinical care, clinical genome databases, genome variant knowledge bases, medical imaging databases, data from medical devices and wearable sensors, and unstructured data sources such as biomedical literature databases. The approach implements the Lambda Architecture enabling both batch and real-time data stream processing and mining.

Predictive Modeling, Real-Time Data Stream Mining, and Big Data Genomics

The back-end provides various tools and frameworks for advanced analytics and decision management. The analytics workbench includes tools for creating predictive models and data streaming mining. The decision management workbench includes a production rule system (providing seamless integration with clinical events and processes) and an ontology editor.

The incoming clinical data likely meet the Big Data criteria of volume, velocity, and variety (this is particularly true for physiological time series from wearable sensors). Therefore, specialized frameworks for large scale cluster computing like Apache Spark are used to analyze and process the data. Statistical computing and Machine Learning tools like R are used here as well. The goal is knowledge and patterns discovery using Machine Learning model builders like Decision Trees, k-Means Clustering, Logistic Regression, Support Vector Machines (SVMs), Bayesian Networks, Neural Networks, and the more recent Deep Learning techniques. The latter hold great promise in applications such as Natural Language Processing (NLP), medical image analysis, and speech recognition.

These Machine Learning algorithms can support diagnosis, prognosis, simulation, anomaly detection, care alerting, and care planning. For example, anomaly detection can be performed at scale using the k-means clustering machine learning algorithm in Apache Spark. In addition, Apache Spark allows the implementation of the Lambda Architecture and can also be used for genome Big Data analysis at scale.

In another post titled How Good is Your Crystal Ball?: Utility, Methodology, and Validity of Clinical Prediction Models, I discuss quantitative measures of performance for clinical prediction models.

Visual Analytics

Visual Analytics tools like D3.js, rCharts, ploty, googleVis, ggplot2, and ggvis can help obtain deep insight for effective understanding, reasoning, and decision making through the visual exploration of massive, complex, and often ambiguous data. Of particular interest is Visual Analytics of real-time data streams like physiological time series. As a multidisciplinary field, Visual Analytics combines several disciplines such as human perception and cognition, interactive graphic design, statistical computing, data mining, spatio-temporal data analysis, and even Art. For example, similar to Minard's map of the Russian Campaign of 1812-1813 (see graphic below), Visual Analytics can help in comparing different interventions and care pathways and their respective clinical outcomes over a certain period of time by displaying causes, variables, comparisons, and explanations.

Production Rule System, Ontology Reasoning, and NLP

The architecture also includes a production rule engine and an ontology editor (Drools and Protégé respectively). This is done in order to leverage existing clinical domain knowledge available from clinical practice guidelines (CPGs) and biomedical ontologies like SNOMED CT.  This approach complements machine learning algorithms' probabilistic approach to clinical decision making under uncertainty. The production rule system can translate CPGs into executable rules which are fully integrated with clinical processes (workflows) and events. The ontologies can provide automated reasoning capabilities for decision support.

NLP includes capabilities such as:
  • Text classification, text clustering, document and passage retrieval, text summarization, and more advanced clinical question answering (CQA) capabilities which can be useful for satisfying clinicians' information needs at the point of care; and
  • Named entity recognition (NER) for extracting concepts from clinical notes.
The data tier supports the efficient storage of large amounts of time series data and is implemented with tools like Cassandra and HBase. The system can run in the cloud, for example using the Amazon Elastic Compute Cloud (EC2). For real-time processing of distributed data streams, cloud-based solutions like Amazon Kinesis and Lambda can be used.


Clinical Decision Services

The clinical decision services provide intelligence at the point of care typically using deployed predictive models, clinical rules, text mining outputs, and ontology reasoners. For example, Machine Learning algorithms can be exported in predictive markup language (PMML) format for run-time scoring based on the clinical data of individual patients, enabling what is referred to as Personalized Medicine. Clinical decision services include:

  • Diagnosis and prognosis
  • Simulation
  • Anomaly detection 
  • Data visualization
  • Information retrieval (e.g., clinical question answering)
  • Alerts and reminders
  • Support for care planning processes.
The clinical decision services can be deployed in the cloud as well. Other clinical systems can consume these services through a SOAP or REST-based web service interface (using the HL7 vMR and DSS specifications for interoperability) and single sign-on (SSO) standards like SAML2 and OpenID Connect.

Intelligent Personal Assistants (IPAs)

Clinical decision services can also be delivered to patients and clinicians through IPAs. IPAs can accept inputs in the form of voice, images, and user's context and respond in natural language. IPAs are also expanding to wearable technologies such as smart watches and glasses. The precision of speech recognition, natural language processing, and computer vision is improving rapidly with the adoption of Deep Learning techniques and tools. Accelerated hardware technologies like GPUs and FPGAs are improving the performance and reducing the cost of deploying these systems at scale.

Hexagonal, Reactive, and Secure Architecture

Intelligent Health IT systems are not just capable of discovering knowledge and patterns in data. They are also scalable, resilient, responsive, and secure. To achieve these objectives, several architectural patterns have emerged during the last few years:

  • Domain Driven Design (DDD) puts the emphasis on the core domain and domain logic and recommends a layered architecture (typically user interface, application, domain, and infrastructure) with each layer having well defined responsibilities and interfaces for interacting with other layers. Models exist within "bounded contexts". These "bounded contexts" communicate with each other typically through messaging and web services using HL7 standards for interoperability.

  • The Hexagonal Architecture defines "ports and adapters" as a way to design, develop, and test an application in a way that is independent of the various clients, devices, transport protocols (HTTP, REST, SOAP, MQTT, etc.), and even databases that could be used to consume its services in the future. This is particularly important in the era of the Internet of Things in healthcare.

  • Microservices consist in decomposing large monolithic applications into smaller services following good old principles of service-oriented design and single responsibility to achieve modularity, maintainability, scalability, and ease of deployment (for example, using Docker).

  • CQRS/ES: Command Query Responsibility Segregation (CQRS) and Event Sourcing (ES) are two architectural patterns which consist in the use of event-driven messaging and an Event Store for separating commands (write-side) from queries (read-side) relying on the principle of Eventual Consistency. CQRS/ES can be implemented in combination with microservices to deliver new capabilities such as temporal queries, behavioral analysis, complex audit logs, and real-time notifications and alerts.

  • Functional Programming: Functional Programming languages like Scala have several benefits that are particularly important for applying Machine Learning algorithms on large data sets. Like functions in mathematics, functions in Scala have no side effects. This provides referential transparency. Machine Learning algorithms are in fact based on Linear Algebra and Calculus. Scala supports high-order functions as well. Variables are immutable witch greatly simplifies concurrency. For all those reasons, Machine Learning libraries like Apache Mahout have embraced Scala, moving away from the Java MapReduce paradigm.

  • Reactive Architecture: The Reactive Manifesto makes the case for a new breed of applications called "Reactive Applications". According to the manifesto, the Reactive Application architecture allows developers to build "systems that are event-driven, scalable, resilient, and responsive."  Leading frameworks that support Reactive Programming include Akka and RxJava. The latter is a library for composing asynchronous and event-based programs using observable sequences. RxJava is a Java port (with a Scala adaptor) of the original Rx (Reactive Extensions) for .NET created by Erik Meijer.

    Based on the Actor Model and built in Scala, Akka is a framework for building highly concurrent, asynchronous, distributed, and fault tolerant event-driven applications on the JVM. Akka offers location transparency, fault tolerance, asynchronous message passing, and a non-deterministic share-nothing architecture. Akka Cluster provides a fault-tolerant decentralized peer-to-peer based cluster membership service with no single point of failure or single point of bottleneck.

    Also built with Scala, Apache Kafka is a scalable message broker which provides high-throughput, fault-tolerance, built-in partitioning, and replication  for processing real-time data streams. In the reference architecture, the ingestion layer is implemented with Akka and Apache Kafka.

  • Web Application Security: special attention is given to security across all layers, notably the proper implementation of authentication, authorization, encryption, and audit logging. The implementation of security is also driven by deep knowledge of application security patterns, threat modeling, and enforcing security best practices (e.g., OWASP Top Ten and CWE/SANS Top 25 Most Dangerous Software Errors) as part of the continuous delivery process.

An Interface that Works across Devices and Platforms

The front-end uses a Mobile First approach and a Single Page Application (SPA) architecture with Javascript-based frameworks like AngularJS to create very responsive user experiences. It also allows us to bring the following software engineering best practices to the front-end:

  • Dependency Injection
  • Test-Driven Development (Jasmine, Karma, PhantomJS)
  • Package Management (Bower or npm)
  • Build system and Continuous Integration (Grunt or Gulp.js)
  • Static Code Analysis (JSLint and JSHint), and 
  • End-to-End Testing (Protractor). 
For mobile devices, Apache Cordova can be used to access native functions when desired. The main goal is to provide a user interface that works across devices and platforms such as iOS, Android, and Windows Phone.


Interoperability will always be a key requirement in clinical systems. Interoperability is needed between all players in the healthcare ecosystem including providers, payers, labs, knowledge artifact developers, quality measure developers, and public health agencies like the CDC. These standards exist today and are implementation-ready. However, only health IT buyers have the leverage to demand interoperability from their vendors.

Standards related to clinical decision support (CDS) include:

  • The HL7 Fast Healthcare Interoperability Resources (FHIR)
  • The HL7 virtual Medical Record (vMR)
  • The HL7 Decision Support Services (DSS) specification
  • The HL7 CDS Knowledge Artifact specification
  • The DMG Predictive Model Markup Language (PMML) specification.

Overcoming Barriers to Adoption

In a previous post, I discussed a practical approach to addressing challenges to the adoption of clinical decision support (CDS) systems.

Monday, September 15, 2014

Single Sign-On (SSO) for Cloud-based SaaS Applications

Single Sign-On (SSO) is a key capability for Software as a Service (SaaS) applications particularly when there is a need to integrate with existing enterprise applications. In the enterprise world dominated by SOAP-based web services, security has been traditionally achieved with standards like WS-Security, WS-SecurityPolicy, WS-SecureConversation, WS-Trust, XML Encryption, XML Signatures, the WS-Security SAML Token Profile, and XACML.

During the last few years, the popularity of Web APIs, mobile technology, and Cloud-based software services has led to the emergence of light-weight security standards in support of the new REST/JSON paradigm with specifications like OAuth2 and OpenID Connect.

In this post, I discuss the state of the art in standards for SSO.

SAML2 Web SSO Profile

SAML2 Web SSO Profile (not to be confused with the WS-Security SAML Token Profile mentioned earlier) is not a new standard. It was approved as an OASIS standard in 2005. SAML2 Web SSO Profile is still today a force to reckon with when it comes to enabling SSO within the enterprise. In a post titled SAML vs OAuth: Which One Should I Use?, Anil Saldhana, former Lead Identity Management Architect at Red Hat offered the following suggestions:

  • If your usecase involves SSO (when at least one actor or participant is an enterprise), then use SAML.
  • If your usecase involves providing access (temporarily or permanent) to resources (such as accounts, pictures, files etc), then use OAuth.
  • If you need to provide access to a partner or customer application to your portal, then use SAML.
  • If your usecase requires a centralized identity source, then use SAML  (Identity provider).
  • If your usecase involves mobile devices, then OAuth2 with some form of Bearer Tokens is appropriate. who is arguably the leader in cloud-based SaaS services supports SAML2 Web SSO Profile as one of its main SSO mechanisms (see the Salesforce Single Sign-On Implementation Guide). The Google Apps platform supports SAML2 Web SSO Profile as well.

Federal Identity, Credential, and Access Management (FICAM), a US Federal Government initiative has selected SAML2 Web SSO Profile for the purpose of Level of Assurance (LOA) 1 to 4 as defined by the NIST Special Publication 800-62-2 (see ICAM SAML 2.0 Web Browser SSO Profile). This is significant given the challenges associated with identity federation at the scale of a large organization like the US federal government.

SAML bindings specify underlying transport protocols including:

  • HTTP Redirect Binding
  • HTTP POST Binding
  • HTTP Artifact Binding
  • SAML SOAP Binding.

SAML profiles define how the SAML assertions, protocols, and bindings are combined to support particular usage scenarios. The Web Browser SSO Profile and the Single Logout Profile are the most commonly used profiles.

Identity Provider (idP) initiated SSO with POST binding is one the most popular implementations (see diagram below from the OASIS SAML Technical Overview for a typical authentication flow).

The SAML2 Web SSO ecosystem is very mature, cross-platform, and scalable. There are a number of open source implementations available as well. However, things are constantly changing in technology and identity federation is no exception. At the Cloud Identity Summit in 2012, Craig Burton, a well known analyst in the identity space declared:

 SAML is the Windows XP of Identity. No funding. No innovation. People still use it. But it has no future. There is no future for SAML. No one is putting money into SAML development. No one is writing new SAML code. SAML is dead.
 Craig Burton further clarified his remarks by saying:

SAML is dead does not mean SAML is bad. SAML is dead does not mean SAML isn’t useful. SAML is dead means SAML is not the future.
At the time, this provoked a storm in the Twitterverse because of the significant investments that have been made by enterprise customers to implement SAML2 for SSO. 


There is an alternative to SAML2 Web SSO Profile called WS-Federation which is supported in Microsoft products like Active Directory Federation Services (ADFS), Windows Identity Foundation (WIF), and Azure Active Directory. Microsoft has been a strong promoter of WS-Federation and has implemented WS-Federation in several products. There is also a popular open source identity server on the .NET platform called Thinktecture IdentityServer v2 which also supports WS-Federation.

For enterprise SSO scenarios between business partners exclusively using Microsoft products and development environment, WS-Federation could be a serious contender. However, SAML2 is more widely supported and implemented outside of the Microsoft world. For example, and Google Apps do not support WS-Federation for SSO. Note that Microsoft ADFS implements the SAML2 Web SSO Profile in addition to WS-Federation.

OpenID Connect

OpenID Connect is a simple identity layer on top of OAuth2. It has been ratified by the OpenID Foundation in February 2014 but has been in development for several years. Nat Sakimura's Dummy’s guide for the Difference between OAuth Authentication and OpenID is a good resource for understanding the difference between OpenID, OAuth2, and OpenID Connect. In particular, it explains why OAuth2 alone is not strictly an authentication standard. The following diagram from the OpenID Connect specification represents the components of the OpenID Connect stack (click to enlarge).

Also note that OAuth2 tokens can be JSON Web Token (JWT) or SAML assertions.

The following is the basic flow as defined in the OpenID Connect specification:

  1. The RP (Client) sends a request to the OpenID Provider (OP).
  2. The OP authenticates the End-User and obtains authorization.
  3. The OP responds with an ID Token and usually an Access Token.
  4. The RP can send a request with the Access Token to the UserInfo Endpoint.
  5. The UserInfo Endpoint returns Claims about the End-User.

There are two subsets of the Core functionality with corresponding implementer’s guides:

  • Basic Client Implementer’s Guide –for a web-based Relying Party (RP) using the OAuth code flow
  • Implicit Client Implementer’s Guide – for a web-based Relying Party using the OAuth implicit flow

OpenID Connect is particularly well-suited for modern applications which offer RESTful Web APIs,  support JSON payloads, run on mobile devices, and are deployed to the Cloud. Despite being a relatively new standard, OpenID Connect also boasts an impressive list of implementations across platforms. It is already supported by big players like Google, Microsoft, PayPal, and Salesforce.  In particular, Google is consolidating all federated sign-in support onto the OpenID Connect standard. Open Source OpenID Connect Identity Providers include the Java-based OpenAM and the .Net-based Thinktecture Identity Server v3.

From WS* to JW* and JOSE

As can be seen from the diagram above, a complete identity federation ecosystem based on OpenID Connect will also require standards for representing security assertions, digital signatures, encryption, and cryptographic keys. These standards include:

  • JSON Web Token (JWT)
  • JSON Web Signature (JWS)
  • JSON Web Encryption (JWE)
  • JSON Web Key (JWK)
  • JSON Web Algorithms (JWA).

There is a new acronym for these emerging JSON-based identity and security protocols: JOSE which stands for Javascript Object Signing and Encryption. It is also the name of the IETF Working Group developing JWS, JWE, and JWK. A Java-based open source implementation called jose4j is available.

Access Control with the User Managed Access (UMA)

According to the UMA Core specification,

User-Managed Access (UMA) is a profile of OAuth 2.0. UMA defines how resource owners can control protected-resource access by clients operated by arbitrary requesting parties, where the resources reside on any number of resource servers, and where a centralized authorization server governs access based on resource owner policy.
In the UMA protocol, OpenID Connect provides federated SSO and is also used to convey user claims to the authorization server. In a previous post titled Patient Privacy at Web Scale, I discussed the application of UMA to the challenges of patient privacy.

Monday, August 25, 2014

Ontologies for Addiction and Mental Disease: Enabling Translational Research and Clinical Decision Support

In a previous post titled Why do we need ontologies in healthcare applications, I elaborated on what ontologies are and why they are different from information models of data structures like relational database schemas and XML schemas commonly used in healthcare informatics applications. In this post, I discuss two interesting applications of ontology engineering related to addiction and mental disease treatment. The first is the use of ontologies for achieving semantic interoperability in  translational research. The second is the use of ontologies for modeling complex medical knowledge in clinical practice guidelines (CPGs) for the purpose of automated reasoning during execution in clinical decision support systems (CDS) at the point of care.

Why Semantic Interoperability is needed in biomedical translational research?

In order to accelerate the discovery of new effective therapeutics for mental health and addiction treatment, there is a need to integrate data across disciplines spanning biomedical research and clinical care delivery [1]. For example, linking data across disciplines can facilitate a better understanding of treatment response variability among patients in addiction treatment. These disciplines include:

  • Genetics, the study of genes.
  • Chemistry, the study of chemical compounds including substances of abuse like heroin.
  • Neuroscience, the study of the nervous system and the brain (addiction is a chronic disease of the brain)
  • Psychiatry which is focused on the diagnosis, treatment, and prevention of addiction and mental disorders.

Each of these disciplines has its own terminology or controlled vocabularies. In the clinical domain for example, DSM5 and RrxNorm are used for documenting clinical care. In biomedical research, several ontologies have been developed over the last few years including:
  • The Gene Ontology (GO)
  • The Chemical Entities of Biological Interest Ontology (CHEBI)
  • NeuroLex, an OWL ontology covering major domains of neuroscience: anatomy, cell, subcellular, molecule, function, and dysfunction.

To facilitate semantic interoperability between these ontologies, there are best practices established by the Open Biomedical Ontology (OBO) community. An example of best practice is the use of an upper-level ontology called the Basic Formal Ontology (BFO) which acts as a common foundational ontology upon which  new ontologies can be created. OBO ontologies and principles are available on the OBO Foundry web site.

Among the ontologies available on the OBO Foundry is the Mental Functioning Ontology (MF) [2, 3]. The MF is being developed as a collaboration between the University of Geneva in Switzerland and the University at Buffalo in the United States. The project also includes a Mental Disease Ontology (MD) which extends the MF and the Ontology for General Medical Science (OGMS). The Basic Formal Ontology (BFO) is an upper-level ontology for both the MF and the OGMS. The picture below is a view of the class hierarchy of the MD showing details of the class "Paranoid Schizophrenia" in the right pane of the windows of the beta release of Protege 5, an open source ontology development environment (click on the image to enlarge it).

The following is a tree view of the "Mental Disease Course" class (click on the image to enlarge it):

Ontology constructs defined by the OWL2 language can help establish common semantics (meaning) and relationships between entities across domains. These constructs provide automated inferencing capabilities such as equivalence (e.g., owl:sameAs and owl:equivalentClass) and subsumption (e.g., rdfs:subClassOf) relationships between entities.

In addition, publishing data sources following Linked Open Data (LOD) principles and semantic search using federated SPARQL queries can help answer new research questions. Another application is semantic annotation for natural language processing (NLP) applications.


Ontologies as knowledge representation formalism for clinical decision support (CDS)

As knowledge representation formalism, ontologies are well suited for modeling complex medical knowledge and can facilitate reasoning during the automated execution of clinical practice guidelines (CPGs) and Care Pathways (CPs) based on patient data at the point of care. Several approaches to modelling CPGs and CPs have been proposed in the past including PROforma, HELEN, EON, GLIF, PRODIGY, and SAGE. However, the lack of free and open source tooling has been a major impediment to a wide adoption of these knowledge representation formalisms. OWL has the advantage of being a widely implemented W3C Recommendation with available mature open source  tools.

In practice, the medical knowledge contained in CPGs can be manually translated into IF-THEN statements in most programming languages. Executable CDS rules (like other complex types of business rules) can be implemented with a production rule engine using forward chaining. This is the approach taken by OpenCDS and some large scale CDS implementations in real world healthcare delivery settings. This allows CDS software developers to externalize the medical knowledge contained in clinical guidelines in the form of declarative rules as opposed to embedding that knowledge in procedural code. Many viable open source business rule management systems (BRMS) are available today and provide capabilities such as a rule authoring user interface, a rules repository, and a testing environment.

However, production rule systems have a limitation. They do not scale because they require writing a rule for each clinical concept code (there are more than 311,000 active concepts in SNOMED CT alone). An alternative is to exploit the class hierarchy in an ontology so that subclasses of a given superclass can inherit the clinical rules that are applicable to the superclass (this is called subsumption). In addition to subsumption, an OWL ontology also support reasoning with description logic (DL) axioms [4].

An ontology designed for a clinical decision support (CDS) system can integrate the clinical rules from a CPG, a domain ontology like the Mental Disorder (MD) ontology, and the patient medical record from an EHR database in order to provide inferences in the form of treatment recommendations at the point of care. The OWL API [5] facilitates the integration of ontologies into software applications. It supports inferencing using reasoners like Pellet and HermiT. OWL2 reasoning capabilites can be enhanced with rules represented in SWRL (Semantic Web Rule Language) which is implemented by reasoners like Pellet as well as the Protege OWL development environement. In addition to inferencing, another benefit of an OWL-based approach is transparency: the CDS system can provide an explanation or justification of how it arrives at the treatment recommendations.

Nonetheless, these approaches are not mutually exclusive: a production rule system can be integrated with business processes, ontologies, and predictive analytics models. Predictive analytics models provide a probabilistic approach to treatment recommendations to assist in the clinical decision making process.


[1]  Janna Hastings, Werner Ceusters, Mark Jensen, Kevin Mulligan and Barry Smith. Representing mental functioning: Ontologies for mental health and disease. Proceedings of the Mental Functioning Ontologies workshop of ICBO 2012, Graz, Austria.

[2]  Ceusters, W. and Smith, B. (2010a). Foundations for a realist ontology of mental disease. Journal of Biomedical Semantics, 1(1), 10.

[3] Hastings, J., Smith, B., Ceusters, W., and Mulligan, K. (2012). The mental functioning ontology., last accessed August 24, 2014

[4] Sesen MB, Peake MD, Banares-Alcantara R, Tse D, Kadir T, Stanley R, Gleeson F, Brady M. 2014 Lung Cancer Assistant: a hybrid clinical decision support application for lung cancer care. J. R. Soc. Interface 11: 20140534.

[5] Matthew Horridge, Sean Bechhofer. The OWL API: A Java API for OWL Ontologies Semantic Web Journal 2(1), Special Issue on Semantic Web Tools and Systems, pp. 11-21, 2011.

Sunday, August 17, 2014

Natural Language Processing (NLP) for Clinical Decision Support: A Practical Approach

A significant portion of the electronic documentation of clinical care is captured in the form of unstructured narrative text like psychotherapy and progress notes. Despite the big push to adopt structured data entry (as required by the Meaningful Use incentive program for example), many clinicians still like to document care using free narrative text. The advantage of using narrative text as opposed to coded entries is that narrative text can tell the story of the patient and the care provided particularly in complex cases. My opinion is that free narrative text should be used to complement coded entries when necessary to capture relevant information.

Furthermore, medical knowledge is expanding very rapidly. For example, PubMed has more than 24 millions citations for biomedical literature from MEDLINE, life science journals, and online books. It is impossible for the human brain to keep up with that amount of knowledge. These unstructured sources of knowledge contain the scientific evidence that is required for effective clinical decision making in what is referred to as Evidence-Based Medicine (EBM).

In this blog, I discuss two practical applications of Natural Language Processing (NLP). The first is the use of NLP tools and techniques to automatically extract clinical concepts and other insight from clinical notes for the purpose of providing treatment recommendations in Clinical Decision Support (CDS) systems. The second is the use of text analytics techniques like clustering and summarization for Clinical Question Answering (CQA).

The emphasis of this post is on a practical approach using freely available and mature open source tools as opposed to an academic or theoretical approach. For a theoretical treatment of the subject, please refer to the book Speech and Language Processing by Daniel Jurafsky and James Martin.

Clinical NLP with Apache cTAKES

Based on the Apache Unstructured Information Management Architecture (UIMA) framework and the Apache OpenNLP natural language processing toolkit, Apache cTAKES provides a modular architecture utilizing both rule-based and machine learning techniques for information extraction from clinical notes. cTAKES can extract named entities (clinical concepts) from clinical notes in plain text or HL7 CDA format and map these entities to various dictionaries including the following Unified Medical Language System (UMLS) semantic types: diseases/disorders, signs/symptoms, anatomical sites, procedures, and medications.

cTAKES includes the following key components which can be assembled to create processing pipelines:

  • Sentence boundary detector based on the OpenNLP Maximum Entropy (ME) sentence detector.
  • Tokenizor
  • Normalizer using the National Library of Medicine's Lexical Variant Generation (LVG) tool
  • Part-of-speech (POS) tagger
  • Shallow parser
  • Named Entity Recognition (NER) annotator using dictionary look-up to UMLS concepts and semantic types. The Drug NER can extract drug entities and their attributes such as dosage, strength, route, etc.
  • Assertion module which determines the subject of the statement (e.g., is the subject of the statement the patient or a parent of the patient) and whether a named entity or event is negated (e.g., does the presence of the word "depression" in the text implies that the patient has depression).
Apache cTAKES 3.2 has added YTEX, a set of extensions developed at Yale University which provide integration with MetaMap, semantic similarity, export to Machine Learning packages like Weka and R, and feature engineering.

The following diagram from the Apache cTAKES Wiki provides an overview of these components and their dependencies (click to enlarge):

Massively Parallel Clinical Text Analytics in the Cloud with GATECloud

The General Architecture for Text Engineering (GATE) is a mature, comprehensive, and open source text analytics platform. GATE is a family of tools which includes:

  • GATE Developer: an integrated development environment (IDE) for language processing components with a comprehensive set of available plugins called CREOLE (Collection of REusable Objects for Language Engineering). 
  • GATE Embedded: an object library for embedding services developed with GATE Developer into third-party applications.
  • GATE Teamware: a collaborative semantic annotation environment based on a workflow engine for creating manually annotated corpora for applying machine learning algorithms. 
  • GATE Mímir: the "Multi-paradigm Information Management Index and Repository" which supports a multi-paradigm approach to index and search over text, ontologies, and semantic metadata.
  • GATE Cloud: a massively parallel clinical text analytics platform (Platform as a Service or PaaS) built on the Amazon AWS Cloud.
What makes GATE particularly attractive is the recent addition of PaaS which can boost the productivity of people involved in large scale text analytics tasks.


Clustering, Classification, Text Summarization, and Clinical Question Answering (CQA)


An unsupervised machine learning approach called Clustering can be used to classify large volumes of medical literature into groups (clusters) based on some similarity measure (such as the Euclidean distance). Clustering can be applied at the document, search result, and word/topic levels. Carrot2 and Apache Mahout are open source projects that provide several methods for document clustering. For example, the Latent Dirichlet Allocation learning algorithm in Apache Mahout automatically clusters words into topics and documents into mixtures of topics. Other clustering algorithms in Apache Mahout include: Canopy, Mean-Shift, Spectral, K-Means and Fuzzy K-Means. Apache Mahout is part of the Hadoop ecosystem and can therefore scale to very large volumes of unstructured text.

Document classification essentially consists in assigning predefined set of labels to documents. This can be achieved through supervised machine learning algorithms. Apache Mahout implements the Naive Bayes classifier.

Text summarization techniques can be used to present succinct and clinically relevant evidence to clinicians at the point of care. MEAD ( is an open source project that implements multiple summarization algorithms. In the biomedical domain, SemRep is a program that extracts semantic predications (subject-relation-object triples) from biomedical free text. Subject and object arguments of each predication are concepts from the UMLS Metathesaurus and the relation is from the UMLS Semantic Network (e.g., TREATS, Co-OCCURS_WITH). The SemRep summarization provides a short summary of these concepts and their semantic relations.

AskHermes (Help clinicians to Extract and aRrticulate Multimedia information for answering clinical quEstionS) is a project that attempts to implement these techniques in the clinical domain. It allows clinicians to enter questions in natural language and uses the following unstructured information sources: MEDLINE abstracts, PubMed Central full-text articles, eMedicine documents, clinical guidelines, and Wikipedia articles.

The processing pipeline in AskHermes includes the following: Question Analysis, Related Questions Extraction, Information Retrieval, Summarization and Answer Presentation. AskHermes performs question classification using MMTx (MetaMap Technology Transfer) to map keywords to UMLS concepts and semantic types. Classification is achieved through supervised machine learning algorithms such as Support Vector Machine (SVM) and conditional random fields (CFRs). Summarization and answer presentation are based on clustering techniques. AskHermes is powered by open source components including: JBoss Seam, Weka, Mallet , Carrot2 , Lucene/Solr, and WordNet (a lexical database for the English language).

Saturday, August 9, 2014

Enabling Scalable Realtime Healthcare Analytics with Apache Spark

Modern and massively parallel computing platforms can process humongous amounts of data in real time to obtain actionable insights for effective clinical decision making. In this blog, I discuss an emerging Big Data platform called Apache Spark and its application to remote real-time healthcare monitoring using data from medical devices and wearable sensors. The goal is to provide effective remote care for an increasingly aging population as well as public health surveillance.

The Apache Spark Framework

Apache Spark has emerged during the last couple of years as an innovative platform for Big Data and in-memory cluster computing capable of running programs up to 100x faster than traditional Hadoop MapReduce. Apache Spark is written in Scala, a functional programming language (see my previous post titled Navigating in Scala land). Spark also offers a Java and a Python APIs. The Scala API allows developers to interact with Spark by using very concise and expressive Scala code.

The Spark stack also includes the following integrated tools:

  • Spark SQL which allows relational queries expressed in SQL, HiveQL, or Scala to be executed using Spark through a data abstraction called SchemaRDD. Supported data sources include Parquet files (a columnar storage format for Hadoop), JSON datasets, or data stored in Apache Hive.

  • Spark Streaming which enables fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ or plain old TCP sockets. The ingested data can be directly processed with Spark built-in Machine Learning algorithms.

  • MLlib (Machine Learning Library) provides a library of practical Machine Learning algorithms including support vector machines (SVM), logistic regression, decision trees, naive Bayes, and k-means clustering.

  • GraphX which provides graph-parallel computation for graph-analytics application like social networks.

Apache Spark can also play nicely with other frameworks within the Hadoop ecosystem. For example, it can run standalone or on a Hadoop 2's YARN cluster manager, on Amazon EC2 or a Mesos cluster manager. Spark can also read data from HFDS, HBase, Cassandra or any other Hadoop data source. Other noteworthy integrations include:

  • SparkR, an R package allowing the use of Spark from R, a very popular open source software environment for statistical computing with more that 5800 packages including Machine Learning packages; and

  • H2O-Sparkling which provides an integration with the H2O platform through in-memory sharing with Tachyon, a memory-centric distributed file system for data sharing across cluster frameworks. This allows Spark applications to leverage advanced distributed Machine Learning algorithms supported by the H2O platform like emerging Deep Learning algorithms.


Wearable Sensors for Remote Healthcare Monitoring 

Three factors are contributing to the availability of massive amounts of clinical data: the rising adoption of EHRs by providers thanks in part to the Meaningful Use incentive program; the increasing use of medical devices including wearable sensors used by patients outside of healthcare facilities; and medical knowledge (for example in the form of medical research literature).

One promising area in Healthcare Informatics where Big Data architectures like the one provided by Apache Spark can make a difference is in applications using data from wearable health monitoring sensors for anomaly detection, care alerting, diagnosis, care planning, and prediction. For example, anomaly detection can be performed at scale using the k-means clustering machine learning algorithm in Spark.

These sensors and devices are part of a larger trend called the "Internet of Things". They enable new capabilities such as remote health monitoring for personalized medicine and chronic care management for an increasingly aging population as well as public health surveillance for outbreaks and epidemics.

Wearable sensors can collect vital signs data like weight, temperature, blood pressure (BP), heart rate (HR), blood glucose (BG), respiratory rate (RR), electrocardiogram (ECG), oxygen saturation (SpO2), and Photoplethysmography (PPG). Spark Streaming can be used to perform real-time stream processing on sensors data and the data can be processed and analyzed using the Machine Learning algorithms available in MLlib and the other integrated frameworks like R and H2O. What makes Spark particularly suitable for this type of applications is that sensor data meet the Big Data criteria of volume, velocity, and variety.

Researchers predict that internet use on mobile phones will increase 20-fold in Africa in the next five years. The number of mobile subscriptions in sub-Saharan Africa is expected to reach 635 millions by the end of this year. This unprecedented level of connectivity (fueled in part by the historical lack of land line infrastructure) provides opportunities for effective public health surveillance and disease management in the developing world.

Apache Spark is the type of open source computing infrastructure that is needed for distributed, scalable, and real-time healthcare analytics for reducing healthcare costs and improving outcomes.