Using OpenEHR as the Basis for a Data Warehouse?!?

Anyone au fait with openEHR knows the benefits of using its dual-modeling architecture to underpin an electronic health record. It is, after all what it was designed for. But I recently came across a paper which espoused the virtues of using openEHR as the basis for a data warehouse which made me think about the wider attributes of that underlying reference model.

The shock and happiness of finding out openEHR can be used as a data warehouseThe shock and happiness of finding out openEHR can be used as a data warehouse

It's SNOW joke

The paper in question, Archetype-based data warehouse environment to enable the reuse of electronic health record data by Marco-Ruiz et al. (2015)[1] reflects on a real world implementation of an openEHR based clinical decision support system in Norway. Their primary goal was to reuse data available in multiple different systems that unfortunately had been stored in legacy data formats. The challenge was to take this, transform it and then make it available for the successor to a disease surveillance system called SNOW.

The authors give a great overview of the principle problems with semantic interoperability, and the need for common syntax and clinical definitions. It also introduced me to the concept of the "impedance mismatch"[2] that exists between the information model (i.e. the electronic health record) and the inference model. The latter is needed for decision support and data warehousing to understand clinical guidelines and protocols. An inference model is defined by Rector et al. as[3]

models that encapsulate knowledge needed to derive the conclusions, decisions, and actions that follow from what is stored.

SNOW facilitated some clever use cases to support microbiology teams. Based on certain factors, staff were able to order additional tests for infectious diseases such as Norovirus if they believed that there was the beginning of an epidemic or some form of outbreak.

But the underlying architecture was problematic and a standards based approach was sought. Traditional clinical information systems were not usually structured in a way that supported ad-hoc queries (with accompanying services), but of course openEHR is different thanks to the Archetype Query Language (AQL). This is part of the reason why the new data warehouse architecture was based on the openEHR standard.

Getting Composed

Photo by Daniel Cheung / Unsplash

The authors used LinkEHR as their "normalization platform" to transform data from the legacy source to the target archetype structures. Marand's Think!EHR was used for the persistent clinical data repository (CDR).

One of the interesting points they make is the importance of the clinical modeling. In most cases when looking to migrate legacy data, an Extract, Transform and Load (ETL) process is carried out. These steps refer to firstly getting the legacy data out, then converting it to some format or structure that fits the new system, and finally getting loaded into the new repository. But Marco-Ruiz and friends added modeling to the first stage of proceedings.

Of course this makes perfect sense as the archetype structure needed to be clearly built up front. But they make recommendations within the paper to indicate that where health data is concerned we should potentially consider a new acronym based on Model/Extract/Transform/Load (METL) as standard. Part of the reason for this, and their support for openEHR architecture, is that the reference model takes a lot of the effort away from defining a complex target schema in which the transformed data needs to reside.

The paper describes a model based on a composition per patient, and per microbiology request. So a single patient may have more than one request, but importantly that the batches (or result profiles) were maintained. This translates roughly to;

                <BATTERY TESTS>
                    <TEST 1>
                    <RESULT DATA>
                    <TEST 2>
                    <RESULT DATA>
                    <TEST n>
                    <TEST n DATA>
                <PATIENT DETAILS>
                <TEST REQUESTER DETAILS>

Note: I've greatly simplified the above so please refer to the original article if you are interested in the actual template structure.


The paper has a lot of positive points to make on the newly developed architecture;

  • OpenEHR has innate flexibility which simplified the data warehouse implementation. The authors describe this as "agile aggregation of data sets".
  • AQL queries are universally accepted within openEHR systems; sharing these is a positive benefit as well as being able to open their system up to other queries due to the standards based approach.
  • The resulting models could be easily understood.
  • Comparing other architectures such as a more pure ontological approach were seen less favourably, especially considering most are not as standards complaint as openEHR.


It was not all plain sailing however. The authors point out some issues that were encountered;

  • Data warehouse transactional control was difficult. If a load failed, it would need to be fully rolled back to ensure consistency with the inference model. This is less a concern for openEHR but more a reflection of the complexity of the transformation which the data were undergoing.
  • All but one of the rules from the SNOW system could be implemented into the new platform as historical checks for data could not brought over. While I am not 100% sure, I think they refer to operational aspects of the previous SNOW system.
  • They found a lack of common information model. As each legacy system had a slightly different mapping, there was duplication of processing. That said, this is common to ETL processes at large, and a symptom of legacy data and systems.
  • The AQL was not found to be complete in terms of their needs in a query language. There are no GROUP or SUBQUERY operators available at the time of writing and they required a more powerful language for some database queries like SPARQL (although this was indicated as potentially requiring complex mappings).
  • Query times varied from acceptable to slow (although they hint that maybe they could have specified a more powerful server!). Improving this was indicated as future work.

Was It Worth It?

The authors state that the techniques used were were mature enough to be easily integrated with new systems. In the round, they found a way to ensure semantic coherence for legacy data and support a new, operational platform in the process. It sounds pretty impressive.

Although as a basis for a data warehouse seems like an interesting proposition, I am not totally convinced. Without a doubt a jewel in the crown of openEHR is the ability to use AQL to query across the longitudinal patient record as well as vertically through a composition. This is thanks to the reference model that binds these facets together. However, as Marco-Ruiz et al. state, the language is not necessarily robust enough for detailed data warehouse tasks.

For example, AQL will find it difficult to compete against the R language if you need to undertake some serious data science. There will be undoubtedly better tools available for these tasks. But regardless, using AQL to even prep for data manipulation is a significant benefit over competing EHR technologies. This research just makes me think that there may be even more flexibility under the bonnet of openEHR than I already thought.

  1. Marco-Ruiz, L., Moner, D., Maldonado, J. A., Kolstrup, N., & Bellika, J. G. (2015). Archetype-based data warehouse environment to enable the reuse of electronic health record data. International Journal of Medical Informatics. ↩︎

  2. G. Schadow, D.C. Russler, C.N. Mead, C.J. McDonald, Integrating medical information and knowledge in the HL7 RIM, Proc. AMIA Annu. Symp. AMIA Symp. (2000) 764–768. ↩︎

  3. Rector, A. L., Johnson, P. D., Tu, S., Wroe, C., & Rogers, J. (n.d.). Interface of Inference Models with Concept and Medical Record Models. Retrieved from ↩︎