Text Analytics in the Healthcare Industry: Data Warehousing and Applications

Text Analytics in the Healthcare Industry: Data Warehousing and Applications

Abstract— Text analytics is the method of extracting information from text. It involves structuring the text to evaluate, discover patterns and interpret the output. It enhances meaning to data and finds nuggets of information from both transaction-based and decision support systems by removing the barrier between structured and unstructured data. Analysis of text data helps to discover new relationships from the clinical databases to better understand the observations and outcomes in care and other fields of medicine. Regardless of challenges, new insights aid in preventing and predicting the unfavorable health outcomes. 

The purpose of this article is to provide a comprehensive overview of how text data is collected, analyzed and used in the healthcare industry. First, a brief introduction of the need for text analytics in healthcare. Then, summaries of few papers are presented to understand the motivation and need to build a data warehouse. To understand the process of building a data warehouse, the paper discusses how the text data was sourced, ETL tools and technologies used to build data warehouse and analyze the data. Subsequently, typical tasks using text analytics, applications, challenges are presented. Finally, trends not discussed in the papers are reported.


Healthcare industry collects a vast amount of data in form of text. Electronic Medical Records with images and annotations resulted in 26.5 and 44 petabytes of data to be collected in the California-based health network in 2013 alone[1]. 80% of the information is stored as text [2] in the form of Electronic Health Records, hand-written physician’s notes about the patient’s visits, prescriptions, letters and social media. Further, “the average doctor spends 40 percent of their time processing thousands of administrative documents and forms and chasing down hundreds of missing lab and imaging orders”[3].

Analysis of this massive, diverse data provides actionable insights. The industry harnesses this data to reduce costs, personalize and streamline services while improving the quality of patient’s care. The information gathered from data analysis also helps to promote the clinical and research initiatives with fewer medical errors and lower costs while fulfilling the mandatory compliance & regulatory requirements. For example, it is estimated that use of analytics can help to save $300 billion per year in U.S. healthcare alone [1].

However, according to Gartner only 5% of the collected text that is being analyzed by the industry [4].

Additionally, text brings several challenges such as identification of key clinical variables, including intense terminological variability and ambiguity. The data also vary in complexity, length and use of technical vocabulary. This makes knowledge discovery complex.

This article discusses the literature with real world examples of text analytics to understand how data is collected, analyzed and used in the industry.

Related Literature

Following is a summary of the reasons why few of the papers built their data warehouse (DWH).

One the papers used clinical information from patients at Ajou University Hospital (Suwon, Korea) to compare drug-eluting stents (DES) for Percutaneous coronary intervention (PCI) to treat coronary heart disease. However, electronic health records presented a challenge as files written in text could not be analyzed using the existing tools. Furthermore, there were huge quantities of CAG reports which made it impossible to analyze manually. The researchers developed their own regular expression to identify strings in texts. This helped them to extract features about patients from the reports such as sex, age and history of hypertension. These features built the DWH for the statistical analysis. Clinical performance of DES and analysis of hazard ratio of target vessel revascularization(TVR) helped in validating its value and safety[5].

The study analyzed text stored in DWH established in 2010 of the elderly and dependent residents’ in a private group specializing in medical accommodations. The objective was to build a physiotherapy corpus, generate knowledge about the resident’s lives and create a linking tool for the medical staff. It focused on collection of physiotherapy data to record the motor functions and use the data for personalized intervention leading to an improvement in the targeting of care[6].

The paper is about scalable open source SQL Server Integration Services package, called Regextractor, which manipulate electronic health records into easy to analyses formats using regular expression parsers. The package was used to extracts, transform, and load workflow.  Data mart containing pulmonary functions was tested against manual chart review[7].

Text Mining technologies extracted and analyzed data from unstructured and semi-structured data like XML and MML (Mining Markup Language). It involved dealing with texts in documents and discovering hidden patterns. Text Analytics is used to search and enable clustering of results [8].

The authors discuss how text analytics is dealing with unstructured data for decision making. Although there are tools available to handle structured data, it is difficult to organize, process and find meaningful patterns in the unstructured data. They developed a framework to transform unstructured textual sources. They also discuss different analytics techniques, drawbacks and data modelling for unstructured sources[9].

Text analytics is different because it looks for a structure in the textual source materials and either apply linguistic and/or statistical techniques to extract concepts and patterns that can be applied to categorize and classify documents, video and audio. The analysis helps in finding patterns, relationships and information to enable business make decisions[2].

The authors discussed techniques used to search relationships in data gathered on 3,902 obstetrical patients. They describe transferring of database from computer-based patient record system (CPRS) into a data warehouse server, extracting and cleaning data to do exploratory factor analysis for determining potential factors contributing to preterm birth. They test the findings by comparing results with other studies[10].

Tools and technology for Data Warehouse and Analysis

Clinical information from patients at Ajou University Hospital was extracted from a total of 13,567 free-text records such as coronary angiography (CAG) between February 2010 and October 2014This is the only study which had an ethics statement. The study was approved by the Institutional Review Board because the data was being anonymized retrospectively. Regular expressions were written in R and ‘stringr’ package was used for mining patterns. Regular expressions are cost-effective and reliable method for pattern matching and sentiment analysis for these semi-structured documents.  Further three programmers wrote scripts to extract three different tables. Only the matching parts of the tables were used for the data warehouse. The data was organized in a table of pre-defined vessel and stent terms using recurring words such as newline, colon, slash. The completed PCI data warehouse was manually validated. Hypertension was either assigned ICD-10 diagnostic code or associated with the prescription of at least one type of antihypertensive drug[5].

Data sources include both resident’s health path and existing EHR data. The data was de-identified and anonymized. Variables such as age, sex, medical history and pathology were also included from socio-demographic table to create a complete picture. ORACLE® queries were used to extract data from tables from DWH to generate a physiotherapy corpus. R Studio® was used for text analysis. There is no mention of the package which was used for analysis[6].

Database includes eleven variables from pulmonary function tests 100 random participants with scleroderma. One of the assistants manually appraised, hypothesized and entered relevant data. The data is integrated and tested with the automated pulmonary function within the North-western Medical Enterprise Data Warehouse (Medical Enterprise Data Warehouse). It is a 10 TB electronic Microsoft SQL Server 2008R2 database developed to collect and integrate patient information gathered from over 30 medical and clinical research database systems. Data marts are created by joining data from copies of many of the database system using campus-wide patient identifier to help the programmers integrate, aggregate or process data[7].

Microsoft SQL Server Integration Services is the ETL tool used to create both operational data stores and data marts. Typecasting and applying logic are the most common transformations done together with deserialization to enable nightly synchronization with the Cerner Millennium Oracle database. An open-sourced text extraction SQL Server Integration Services package called Regextractor is also programmed to extract discrete numeric data from text reports[7].

The authors use text tagging to process documents and extract information. Text tagging or Wrapping XML tags consists of examines text to identify noun (names, products, organizations, locations, e-mail addresses) and numerical expressions (measurements, percentages, and monetary values) related to the domain. Application of tagging techniques helps to convert plain text to a model that can be processed into the databases or data warehouses. Statistical techniques are used to investigate the text[8].

The output data from text analytics is a star schema consisting of a fact table named DOCUMENT_ANALYSIS and four-dimension tables. The fact table contains one row for each occurrence of a keyword in a document. The dimension tables describe each analysis entry in the fact table: date, keyword and categories and corresponding document title and content[8].

Data included statistics on inpatient and emergency department utilization with a focus on uninsured events. The data is also categorized by chronic diseases and care management technique. The article details a methodology to justify the cost of developing a DWH based on a business case. The authors created several situations to estimate the cost over 5 years using assumptions based on their experience. They also graphed the break-points for project. The staffing of DWH is also mentioned with the emphasis IT departments cannot be the sole driver. DWH is more successful if commissioned and led by the business. The project manager is a practicing physician with the background in information systems. The supporting team includes a nurse, data analyst and system analyst with experience in healthcare. The consultants are experienced in healthcare systems development[9].

The database identified is the computer-based patient record system known as The Medical Record (TMR) developed at Duke University. TMR’s data structure uses a proprietary class-oriented approach which stores all the patient’s information such as demographics, study results, problems, therapies and encounter summaries in a single record. The data warehouse was created on a centralized server dedicated to text analytics. TMR data was extracted into relational tables on a Microsoft SQL Server V 4.2 on a PC server with a 60 MHz Pentium CPU, 1700 megabytes of hard disk, 16 megabytes of RAM and using the Windows NTT Server 3.5 operating system and file system. A sample two-year dataset (1993-1994) from the data warehouse was created to be mined for knowledge discovery. Multiple SQL queries were run on the data warehouse to create the dataset. Data cleaning was done using Paradox Application Language scripts to identify problems such as missing values and correct the errors. The alphanumeric fields were converted into numerical fields to enable statistical analysis[10].

The construction of a PCI data warehouse opens possibilities for further research. The data warehouse is large enough to provide a statistical validity. EHRs stored can be analyzed through application of technologies such as regular expression and natural language processing (NLP). NLP tools which extract information through semantic analysis of medical records, such as the Linguistic Strings Project and the Medical Language Extraction and Encoding system (MedLEE) can also be used on the data[5].

The user’s health data is analyzed using three unsupervised, dimensionality reduction techniques and R® package FactoMineR®. The three techniques are principal component analysis (PCA), multiple component analysis (MCA) and hierarchical clustering (HC) on principal components (PC) with the HCPC function. The algorithms formed clusters using textual date which can classify subgroups such as repeated or at-risk patients and can predict future health problems such as hospitalization [6].

The article demonstrated how collaboration between DWH architects and healthcare professionals can facilitate access to health data. The methodology can also be used to create multi-use, scalable and accurate data[7].

Text tagging and annotations have made unstructured data’s integration with structured data easy. [8]

Therefore, data in healthcare come from internal and external sources in both structured and unstructured formats such as flat files, .csv, ASCII/text, relational tables, etc. It also comes from multiple geographic locations as well as from different healthcare providers’ sites in numerous applications e.g. transaction processing applications, databases, etc.

Furthermore, DWH allows centralized data storage from multiple location and sources to provide an environment for enhanced decision support and analytics. There are numerous tools and technologies to convert data into actionable information. For example, Intermountain Health Care built a DWH from five different sources -clinical data repository, acute care case-mix system, laboratory information system, ambulatory case-mix system and health plans database. It uses the DWH to find and implement better evidence-based clinical solutions. It has also been suggested a distributed network topology instead of a data warehouse for more efficient data mining [11].

Typical Business Intelligence Tasks

Text analysis is important to increase in the efficiency and effectiveness in the industry with potential savings with $165 billion and $108 billion in waste respectively in clinical operations and R & D [1]. Some of the tasks that are achieved from text analysis are as follows:

  • Information Extraction is the analysis of unstructured text to identify key phrases and relationship within the text. It involves matching patterns to look for predefined sequences in text. It is also used for identifying the commonly shared concepts between documents to help the users find information that they perhaps wouldn’t have found using traditional searching methods. It is commonly used when analyzing large volume of texts and promotes browsing for information rather than searching for it. An example is the use of analysis to make associations between different researches.
  • Topic Tracking is used to match the documents to the users based on the past behavior. In the industry, it is being used to keep a track on new products, competitor analysis and changes to the market.
  • Text summarization is to reduce the length and detail of the document while keeping the main meaning of it. It is helpful to figure out whether a lengthy document meets the user’s needs and is worth reading for further information.
  • Categorization is a supervised machine learning technique which involves identifying the main themes by placing the document into a pre-defined set of topics. The process includes counting the words and infers from the words which topic does the document belong from a predefined relationship.
  • Clustering is an unsupervised machine learning technique used to group similar documents. It does not need the documents to be pre-labelled. It is useful for management information systems and helps in organization of thousands of documents. The documents can also be available in multiple sub-topics thus ensuring that the document is not left out of search results.
  • Information Visualization is a visual representation of text data to communicate some meaningful insight. It helps users to analyze and reason about data and evidence.
  • Question Answering Analysis can help to find the best answers to a given question. It allows the ability for users to ask the computer question and answer technology. [12]

Applications Of Text Analytics

There is vast potential for text analytics in healthcare. Generally, these can be grouped into the following: 

Treatment effectiveness: Text analytics is used to evaluate the effectiveness of medical treatments by comparing causes, symptoms, and courses of treatments. It delivers a report of which courses of action prove effective and associate the various side-effects of treatment. It helps in collating common symptoms to aid diagnosis and determining the most effective drug compounds for treating sub-populations. It also helps to determine proactive steps that can reduce the risk of affliction. For example, United Health Care analyzed its treatment record data to explore the outcomes of patient’s group treated with different drug regimens for the same disease and determined ways to cut costs and deliver better medicine. It also has developed clinical profiles to give physicians information about their practice patterns and to compare these with those of other physicians and peer-reviewed industry standards[11].

Another example is the use of analytics to search for patients with PAD increased from less than 10,000 using a traditional approach to over 41,000 using text analysis[4]. 

Healthcare management: Text analytics is used to better identify and track chronic disease states and high-risk patients, design appropriate interventions and reduce the number of hospital admissions and claims. For example, the Arkansas Data Network looks at readmission and resource utilization and compares its data with current scientific literature to develop better diagnosis and treatment protocols. 

Patient profile analytics: is used to identify individuals who would benefit from proactive care or lifestyle changes. For example, Group Health Cooperative groups its patient populations by demographic characteristics and medical conditions to determine which groups use the most resources, enabling it to develop programs to help educate these populations and prevent or manage their conditions. Another example is that Blue Cross uses analytics to improve outcomes and reduce expenditures through better disease management. For instance, it uses emergency department and hospitalization claims data, pharmaceutical records, and physician interviews to identify unknown asthmatics and develop appropriate interventions. Clinical Call Centers uses text-based decision support assistance to agents for complex patient’s scenarios in a scalable way[13]. 

Research & development: Analytics help to automate the discovery of data elements essential to the natural language processing models which can be used to discover new treatments. It incorporates and evaluates the association of all the available features to lower attrition and produce a leaner, faster, more targeted R & D pipeline in drugs and devices. It results in improving the clinical trial design and patient recruitment to better match treatments to individual patients, thus reducing trial failures and speeding new treatments to market. The analysis also identifies follow-on indications and discover adverse effects before products reach the market. 

Customer relationship management: The text data is analyzed to determine the preferences, usage patterns, and current and future needs of individuals to improve their level of satisfaction through call centers, physicians’ offices, billing departments, inpatient settings, and ambulatory care settings. The data is also analyzed to predict if a consumer is likely to purchase, whether a patient is likely to comply with prescribed treatment or whether preventive care is likely to produce a significant reduction in future utilization. It can also help set reasonable expectations about waiting times, reveal possible ways to improve service, and provide knowledge about what patients want from their healthcare providers while promoting disease education, prevention, and wellness services.  For example, Customer Potential Management Corp. has developed a Consumer Healthcare Utilization Index that provides an indication of an individual’s propensity to use specific healthcare services, defined by 25 major diagnostic categories, selected diagnostic related groups or specific medical service areas[11].

Pharmaceutical companies also benefit by tracking which physicians prescribe which drugs and for what purposes. The companies also decide whom to target, show the least expensive or most effective treatment plan for an ailment, and help identify physicians whose practices are suited to specific clinical trials. They apply analytics to huge masses of genomic data to predict how a patient’s genetic makeup determines his or her response to a drug therapy[12]. 

Fraud and abuse are detected after analysis of text data to establish norms and then identify unusual or abnormal patterns of claims by physicians, laboratories, clinics, or others. Among other things, the data analysis highlights inappropriate prescriptions or referrals and fraudulent insurance and medical claims. For example, the Utah Bureau of Medicaid Fraud has analyzed data generated by millions of prescriptions, operations and treatment courses to identify unusual patterns and uncover fraud. This also helps in cost savings such as ReliaStar Financial Corp. has reported a 20 percent increase in annual savings.[11]. 

Sentiment Analysis: is done to create a picture of the patient’s health status, medical conditions and treatment. Retrieving and evaluating the real-time social data from online social networks like Twitter, blogs or any other social networking sites is used to gather the dynamic trends and outlooks of the public. Between 12% and 15% of sentiment, terms are determined in medical social media datasets when applying existing sentiment lexicons. In contrast, in clinical narratives only between 5% and 11% opinionated terms were identified. This proves the less subjective use of language in clinical narratives, requiring adaptations to existing methods for sentiment analysis. [14]. Overview information got from the survey held in the United States proposed that of the 85% of grownup customers who utilizes the Internet, 25% have perused another person’s opinion and 11% have counselled online user comments of hospitals [15]


Text analytics can greatly benefit the healthcare industry. Nevertheless, there are few challenges such as:

  • Unstructured data: There is a difference between structured data and unstructured data. Structured data is stored in pre-defined models which makes it easy to query transactions, create relations between models, analyze, integrate with other sources and create reports easily. Conversely, unstructured data is more difficult to search, query and extract pattern and integrate with other data sources. Some examples of unstructured sources are emails, conversations and texts[9]. It comprises about 80 percent of the data in the form of handwritten physician’s notes and cannot be accessed easily due to the lack of advanced technical skill sets and infrastructure.

Further, only 95% of health systems utilize this valuable clinical data and present a problem to help the decision support systems and quality improvements applications. For example, in Indiana, a team of analysts failed to identify over 75 percent of patients with peripheral arterial disease (PAD) across identifying the EMR and claims data [4].

  • Methodology and tools: There are insufficient programming skills for managing unstructured data to extract textual data from various sources[8]. There are insufficient resources such as time, effort, and money invested by the industry.[11].

Furthermore, one of the literature was unable to analyze terms and abbreviations used only in Korea due to unavailability of tools in Korean. The tools to analyze specific languages are expensive and labor-intensive to develop. Certain reports had incomplete sentences and thus NLP tools cannot be used[5]. The article also suggests the limitations of the study such as data warehouse is not storing patient’s visits to another hospital for further treatment resulting in under-reported outcomes and insufficient information on baseline characteristics. Another limitation is that regular expression may have caused unknown errors causing an imperfect performance score [5].


The power of text analytics in healthcare to identify innovative, timely and relevant use cases is undeniable. However, the papers failed to discuss trends such as:

  • Big Data Technologies: for instance, NoSQL databases, distributed file store, data virtualization and search and knowledge discovery tools which support self-service extraction of information and insights from unstructured data.
  • Latest techniques: for example, deep learning and word embedding. Deep learning is a sub-section of machine learning in which the algorithms learn data representation. Word embedding is when words or sentences are mapped to vectors of real numbers and used for statistical methods.
  • Open source Natural Language Processing libraries: such as NLTK, Gensim and spaCy which are written in Python.
  • Standardization of clinical vocabulary and the sharing of data across organizations to enhance the benefits of healthcare data mining applications. A siloed approach to text analytics does not work. It is helpful to integrate validated text analytics with all other analytics within the organization.


Text analytics is helping to make data-driven decisions and adapt to answer new research questions. The broad, all-encompassing approach to text analytics is helping the health systems for clinical and financial success as its create a personalized recommendation, enhance the understanding of high-risk patient populations, and improve outcomes. It is helping healthcare industry to invest in solutions that not only solve today’s problems but can also be expanded to unravel future use cases.


[1]        W. Raghupathi and V. Raghupathi, “Big data analytics in healthcare: promise and potential,” Heal. Inf. Life Sci., vol. 2, 2014.

[2]        V. Gupta and G. S. Lehal Professor, “A Survey of Text Mining Techniques and Applications,” Int. J. Comput. Sci. Eng., vol. Vol. 02, No. 06, 2010.

[3]        Laura Kelso, “DC/OS helps athenahealth build flexible and powerful cloud-based services,” Mesosphere, 2018. [Online]. Available: https://mesosphere.com/blog/dcos-athenahealth/. [Accessed: 24-Feb-2018].

[4]        Eric Just, “How to Use Text Analytics in Healthcare to Improve Outcomes—Why You Need More than NLP,” Health Catalyst, 2017. [Online]. Available: https://www.healthcatalyst.com/how-to-use-text-analytics-in-healthcare-to-improve-outcomes. [Accessed: 18-Feb-2018].

[5]        Y. S. Kim et al., “Extracting information from free-text electronic patient records to identify practice-based evidence of the performance of coronary stents,” PLoS One, vol. 12, no. 8, p. e0182889, Aug. 2017.

[6]        T. Delespierre, P. Denormandie, A. Bar-Hen, and L. Josseran, “Empirical advances with text mining of electronic health records,” BMC Med. Inform. Decis. Mak., vol. 17, no. 1, p. 127, Dec. 2017.

[7]        M. Hinchcliff, E. Just, S. Podlusky, J. Varga, R. W. Chang, and W. A. Kibbe, “Text data extraction for a prospective, research-focused data mart: implementation and validation,” BMC Med. Inform. Decis. Mak., vol. 12, no. 1, p. 106, Dec. 2012.

[8]        K. Srinivasa, N. Prasad, and S. Ramakrishna, “Text Analytics to Data Warehousing,” Int. J. Comput. Sci. Eng., vol. 2, no. 6, pp. 2201–2207, 2010.

[9]        E. F. Ewen, C. E. Medsker, L. E. Dusterhoft, K. Levan-Shultz, J. L. Smith, and M. A. Gottschall, “Data warehousing in an integrated health system,” in Proceedings of the 1st ACM international workshop on Data warehousing and OLAP  – DOLAP ’98, 1998, vol. Part F1292, pp. 47–53.

[10]      J. C. Prather, D. F. Lobach, L. K. Goodwin, J. W. Hales, M. L. Hage, and W. E. Hammond, “Medical Data Mining: Knowledge Discovery in a Clinical Data Warehouse,” vol. 4, no. SUPPL., pp. 101–105, 1997.

[11]      T. G. Koh Hian Chye, “Data Mining Applications in Healthcare,” J. Healthc. Inf. Manag., vol. 19, no. 2, 2005.

[12]      G. S. L. Vishal Gupta, “A Survey of Text Mining Techniques and Applications,” J. Emerg. Technol. WEB Intell., vol. 1, no. 1, p. 17, 2009.

[13]      Sandya Mannarswamy, “Healthcare Text Analytics | Xerox Research Centre India,” .xrci.xerox.com, 2014. [Online]. Available: http://www.xrci.xerox.com/healthcare-text-analytics. [Accessed: 18-Feb-2018].

[14]      K. Denecke and Y. Deng, “Sentiment analysis in medical settings: New opportunities and challenges,” Artif. Intell. Med., vol. 64, no. 1, pp. 17–27, May 2015.

[15]      D. M. Mathews and S. Abraham, “Analytic thinking of patients’ viewpoints pertain to spa treatment,” in 2017 International Conference on Networks & Advances in Computational Technologies (NetACT), 2017, pp. 235–241.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.