CC BY-NC-ND 4.0 · Appl Clin Inform 2024; 15(02): 234-249
DOI: 10.1055/a-2259-0008
State of the Art/Best Practice Paper

Simplifying Multimodal Clinical Research Data Management: Introducing an Integrated and User-friendly Database Concept

Anna Schweinar
1   Biomagnetic Center, University Hospital Jena, Friedrich Schiller University, Jena, Germany
2   Else Kröner Graduate School for Medical Students “JSAM,” Jena University Hospital, Jena, Germany
,
Franziska Wagner
1   Biomagnetic Center, University Hospital Jena, Friedrich Schiller University, Jena, Germany
3   Department of Neurology, Jena University Hospital, Jena, Germany
,
Carsten Klingner
1   Biomagnetic Center, University Hospital Jena, Friedrich Schiller University, Jena, Germany
3   Department of Neurology, Jena University Hospital, Jena, Germany
,
Sven Festag
4   Institute of Medical Statistics, Computer and Data Sciences, Jena University Hospital, Jena, Thüringen, Germany
,
Cord Spreckelsen
4   Institute of Medical Statistics, Computer and Data Sciences, Jena University Hospital, Jena, Thüringen, Germany
,
Stefan Brodoehl
1   Biomagnetic Center, University Hospital Jena, Friedrich Schiller University, Jena, Germany
3   Department of Neurology, Jena University Hospital, Jena, Germany
› Author Affiliations
Funding This work was supported by funding from the Foundation “Else Kröner-Fresenius-Stiftung” within the Else Kröner Graduate School for Medical Students “Jena School for Ageing Medicine (JSAM)” and Else Kröner Anti Age.
 

Abstract

Background Clinical research, particularly in scientific data, grapples with the efficient management of multimodal and longitudinal clinical data. Especially in neuroscience, the volume of heterogeneous longitudinal data challenges researchers. While current research data management systems offer rich functionality, they suffer from architectural complexity that makes them difficult to install and maintain and require extensive user training.

Objectives The focus is the development and presentation of a data management approach specifically tailored for clinical researchers involved in active patient care, especially in the neuroscientific environment of German university hospitals. Our design considers the implementation of FAIR (Findable, Accessible, Interoperable, and Reusable) principles and the secure handling of sensitive data in compliance with the General Data Protection Regulation.

Methods We introduce a streamlined database concept, featuring an intuitive graphical interface built on Hypertext Markup Language revision 5 (HTML5)/Cascading Style Sheets (CSS) technology. The system can be effortlessly deployed within local networks, that is, in Microsoft Windows 10 environments. Our design incorporates FAIR principles for effective data management. Moreover, we have streamlined data interchange through established standards like HL7 Clinical Document Architecture (CDA). To ensure data integrity, we have integrated real-time validation mechanisms that cover data type, plausibility, and Clinical Quality Language logic during data import and entry.

Results We have developed and evaluated our concept with clinicians using a sample dataset of subjects who visited our memory clinic over a 3-year period and collected several multimodal clinical parameters. A notable advantage is the unified data matrix, which simplifies data aggregation, anonymization, and export. This streamlines data exchange and enhances database integration with platforms like Konstanz Information Miner (KNIME).

Conclusion Our approach offers a significant advancement for capturing and managing clinical research data, specifically tailored for small-scale initiatives operating within limited information technology (IT) infrastructures. It is designed for immediate, hassle-free deployment by clinicians and researchers.

The database template and precompiled versions of the user interface are available at: https://github.com/stebro01/research_database_sqlite_i2b2.git.


#

Background and Significance

The volume of routinely collected data generated in clinical practice and trials presents multiple challenges for data management in terms of storage, processing, sharing, and protection.[1] [2] [3] [4] Thus, efficient research data management (RDM) is essential in modern medicine and research,[5] and provides researchers with access to routinely collected data. It provides the basis for scientific research, promotes scientific data publication, and increases reproducibility.[1] [5] [6] In general, RDM describes the organization, storage, preservation, and sharing of scientific data.

Especially in neuroscience, RDM is indispensable. The complexity and volume of heterogeneous (multimodal) neuroscience data (e.g., laboratory, microscopy, imaging, clinical examination results and scores, electrophysiological data, etc.) require good documentation, processing, and standardization. Many neurological diseases, especially neurodegenerative diseases, need follow-up observations. In the most common neurodegenerative disease, dementia, the diagnosis is made from multimodal data including neuropsychological testing, imaging, functional imaging, cerebrospinal fluid levels, etc.[7] Patients are evaluated annually, and documenting the large amount of longitudinal and multimodal data generated is challenging. With optimal RDM, the collected data are prepared for further relevant analysis methods and can be efficiently used for retrospective studies.[2]

Unfortunately, there is a large gap in RDM in practice, especially in the neurosciences. A recent online survey by the National Research Data Infrastructure (NFDI-Neuro) on the state of RDM clearly shows the existing problems.[1] Routinely collected clinical and scientific data are usually incomplete or not retrievable.[2] Knowledge of and adherence to data and metadata standards are often limited.[6] Many researchers and clinicians lack the time to invest in standards-compliant data processing and management. In addition, there is a lack of secure data management in terms of privacy and secure data sharing.[1]

How can the Gaps in Research Data Management be Closed to Ensure Optimal Data Storage, Processing, and Standardization?

Recognizing the urgent need for efficient and user-friendly data management systems tailored for clinicians working with multimodal and longitudinal data, our initial research aimed to assess the suitability of existing platforms. However, a gap immediately became apparent: the needs of clinical researchers were not being adequately met by current solutions. For example, while platforms such as Longitudinal Online Research and Imaging System (LORIS) offer rich feature sets, their complexity and steep learning curve make them unsuitable for smaller projects or for clinicians with limited technical skills. This discrepancy is exacerbated by the unstructured and haphazard data management methods commonly used in clinical settings—often relying on rudimentary folder structures and Excel spreadsheets. Furthermore, existing platforms rarely address the specific challenges posed by GDPR regulations around sensitive data, or the ethical imperatives that require localized data storage and restricted access. Finally, the constraints of local IT infrastructures designed with data security in mind can make server-based solutions impractical for routine operations. These multiple, intertwined challenges served as the primary catalyst for our project, highlighting the urgent need for a system that reconciles usability with the complex ethical and regulatory landscape of clinical RDM.[8]


#
#

Objectives

Initially, we researched the most appropriate RDM solution for clinical researchers, particularly in neuroscience, where the need to capture a variety of data types and manage their longitudinal nature is paramount. However, we pinpointed a gap in existing solutions: they frequently fall short in catering to small-scale projects and offline capabilities, while also failing to balance ease of implementation with adherence to Findable, Accessible, Interoperable, and Reusable (FAIR) principles [5] and General Data Protection Regulation (GDPR) regulations.[9]

This prompted us to develop a specialized solution aimed at

  • storing diverse, longitudinal clinical data along with metadata;

  • prioritizing user-friendly design and intuitive data capture;

  • ensuring offline functionality and seamless IT integration;

  • focusing on small-scale projects while maintaining data interchangeability with other applications.


#

Methods

In the following section, we describe the steps for concept development, as shown in [Fig. 1].

Zoom Image
Fig. 1 Flow chart illustrating the steps from the assessment of user and technical requirements, the definition of a concrete use case, the literature research, the development, and implementation of a solution concept to the validation of the result. RDM, research data management.

Step 1: Analysis of User Demands and Definition of Minimal Technical Requirements for Research Data Management

As a first step, we defined the requirements for the target group of clinical researchers (users) working in patient care on a daily basis, based on a set of minimum technical requirements as follows.

User Demands

To define the requirements for a clinical research data storage system from a user perspective, we conducted a standardized expert telephone interview with a total of 41 (open and closed) questions (duration: approximately 35–45 min). We conducted a total of 22 expert interviews, 20 of which were finalized and analyzed for our study. Of the final 20 participants, 11 were aged between 25 and 34 years, 8 were between 35 and 44 years, and 1 participant was over 45 years. In terms of their professions, 14 were physicians, 2 were psychologists, and 4 were researchers; the interview is available in [Supplementary Table S1] (available in the online version only).

Table 1

User demands and technical requirements for the data management solution

Description

Details on how to implement it

User demands

Clear storage structure and easy access

Intuitive interface, structured storage, and efficient search functionality

Defined standards for data storage and export

Adoption of community standards, configurable export options; “no more data munging”

Rights management and data security

Role-based access control, encryption, pseudonymization techniques

Backup strategies, lifecycle management, and data duplication avoidance

Scheduled backups, data retention policies, and duplicate prevention algorithms

Accessibility from clinical workplace and home office

Responsive design, secure remote access, and offline functionality

Technical requirements

FAIR principles: Findable, Accessible, Interoperable, Reusable data principles

Comprehensive metadata, persistent identifiers, standardized data formats

Multiple data types and extensibility: Support for diverse data types and ease of adding new data types

Modular architecture, customizable data type handling, and seamless integration of new types

Data security: Protect sensitive data from unauthorized access, modification, or disclosure

Role-based access controls, encryption, pseudonymization techniques

Data validation: Ensure data integrity and accuracy by validating input data

Input validation techniques at database schema and user interface levels

Duplicate prevention: Maintain data consistency by preventing duplicate records

Duplicate prevention algorithms and database constraints

HL7 integration: Facilitate seamless data exchange with other health care systems

Support HL7 standards for importing and exporting data

Offline functionality: Allow users to access and manage research data from within a local network/file system

Incorporate offline functionality without the need of a server configuration etc.

To avoid confusion, some items have been left duplicated between the user demands and the basic technical requirements (e.g., offline functionality, rights management).



#

Minimal Technical Requirements

Following the German Research Foundation guidelines, we established minimum requirements for an optimal RDM system.[10] [11]

In establishing specific criteria, we focused on elements such as FAIR principles, data security, validation, HL7 integration, and offline functionality. To bolster data interoperability and sustainability, we mandate the incorporation of extensive metadata, persistent identifiers, and standardized data formats, such as HL7.[12] Additionally, we aligned with clinical classification systems like logical observation identifiers names and codes (LOINC) and systematized nomenclature of medicine and clinical terms (SNOMED-CT). For data security, we added role-based access control and potential pseudonymization techniques to the criteria. The specific criteria are listed in [Table 1] under the category of “Technical requirements.”


#
#

Step 2: Use Case: Clinical Research Question with Multimodal Neuroscientific Data

Based on the defined user needs and technical requirements for RDM, we aimed to investigate how these criteria apply to a research question involving a large amount of multimodal data. We decided to use the clinical course data (retrospective) of patients who visited our memory clinic at the Department of Neurology during the period 2014 to 2022 with a diagnosis of mild cognitive impairment (MCI). For illustration, please refer to [Supplementary Fig. S1] (available in the online version only).


#

Step 3: Literature Research and Search for Suitable Existing Research Data Management Solutions

In our study, we conducted a literature search following preferred reporting items for systematic reviews and meta-analyses (PRISMA) statement,[13] focusing on “database solutions for RDM.” We included 25 tools, categorized into Electronic Data Capture (EDC), clinical trial data management systems (CTMS), specific data management systems, and Electronic Laboratory Notebooks (ELN). Categorization is challenging because there is no universally accepted nomenclature, leading to the use of different terminologies in the field. We also distinguished between well-known and lesser known EDC and CTMS. Our research included an extensive literature review to ensure that we covered prominent systems and presented a holistic view of these essential tools. [Table 2] lists the available tools with their technical requirements and a brief functional description.

Table 2

Overview of existing data management solutions based on literature research

Category

Name

Technical requirements

Functions

Renown EDC/CTMS

REDCap (Harris et al 2009[15]) (REMOVED PRINTDATE FIELD)

• Operating system: Windows, macOS, Linux, Unix

• Web server: Windows/IIS and Linux/Apache with PHP 7.2.5 and above

• MySQL installation

• SMTP email server: configuring PHP with an institutional SMTP server

• Secure web application

• Clinical and translational research support

• Create and manage online surveys and databases

• Support for multiple field types

• Built-in reporting and data export tools

OpenClinica (Cavelaars et al 2015[20])

• Operating system: Linux, Unix, or Windows

• Application server: Apache Tomcat

• PostgreSQL installation

• Language: JDK 7

• Open source software

• Data collection and management for research studies

• Built-in reporting and data export tools

• Integrates with visualization and analysis tools

• Allows you to extend their functionality

ClinCapture[34]

• Operating system: Windows, macOS, and Linux

• Internet connection

• Developed from OpenClinica (version 3.1.3) source code

• Modern web browsers

• Open source and cloud-based EDC system

• Data collection and management for research studies

• Drag and drop form design

• Allows you to extend your functionality

• Risk-based monitoring

• Integrates with visualization and analytics tools

MACRO EDC[35]

• Cloud-based or on-site installation

• Clinical data management software

• Data collection and management for research studies

• Drag and drop form design

• Built-in reporting and data export tools

• Integration with visualization and analysis tools

EZ-Entry

(Gao et al 2008[36])

• Operating system: Microsoft Windows

• Microsoft SQL Server 7.0 installation

• Programming language: Microsoft Visual Basic 6.0

• Clinical data management software platform

• Data entry and capture

• eCRF design and management

• Built-in reporting and data export tools

Oracle Clinical[37]

Database:

• Oracle database

• Operating system: Oracle Linux 7.6, 7.7 and 8.1, Oracle Solaris SPARC 11.4, HP-UX Itanium 11.31, Microsoft Windows Server 2019

Application:

• Operating system: Microsoft Windows Server 2016 and 2019 Standard

• Other Oracle-specific applications

Client:

• Operating system: Microsoft Windows 10, iOS 15.0 (iPad), macOS Big Sur version 11.6

• Clinical data management system

• Data collection and management for research studies

• eCRF generation

• Integration with visualization and analysis tools

TrialDB (Brandt et al[38])

• Web-based database system for clinical trials

• Open source

• Support for web-based interfaces

• Supports multiple field types

• Clinical research form (eCRF) generation

Less-known EDC and Less-known CTMS

LORIS (Longitudinal Online Research and Imaging System; Das et al 2012[8])

• Linux-/Ubuntu-based operating system, PHP support

• Web server running NGINX or Apache

• MySQL installation

• Network access and firewall configurations

• Web-based platform for easy access from anywhere

• Open source

• Built-in data entry and validation tools

• Longitudinal data tracking and robust user management

• Integrates with popular neuroimaging processing tools

FIMED (Hurtado et al 2021[39])

• Operating system: Windows, Linux, and macOS

• Web server: Apache or Nginx

• Database: developed in MongoDB (JSON Code)

• Open source clinical data collection software

• Collect and manage multidimensional clinical research data

• eCRF generation

• Integrated analysis and visualization tools

• Ability to extend functionality

DADOS-Prospective (Nguyen et al 2006[40])

• Operating systems: Linux and Windows

• Programming language: Java

• Other requirements: Tomcat 5.x

• Open source web-based prospective data collection application

• Prospective data collection and management

• eCRF creation

• Secure patient information

PhOsCo (Venizeleas et al[41])

• Operating system: Platform-independent (Linux, Windows, AS400)

• Implementation: client–server operation

• Language: Java

• Open source software

• Data collection and management for research studies

• Allows you to extend its functionality

• Hybrid communication with local database synchronization when online

Database management system (Lee et al 2015[42])

• MySQL and phpMyAdmin Installation

• Central machine server

• Client GUI: Microsoft Access

• Database management system for interventional oncology clinical research

• Data collection, organization and reporting

• Built-in reporting and data export tools

• Costing tools integration

Phynx: an open-source software solution (Egbring et al 2010[43])

• Linux Virtual Amazon Machine Image (AMI; Setup: Amazon Elastic Compute Cloud)

• MySQL Installation

• Server running Apache Tomcat

• Open source software to support data management and web-based patient-level data

• Creation and verification of electronic patient profiles from medical databases

• Verification and validation process

Specialized data management tools

SEEK (Wolstencroft et al 2015[44])

• Operating systems: Linux, Mac OS X, VirtualBox-based virtual machine

• Other requirements: Ruby on Rails environment - version 3.2.x

• Programming language: Ruby 2.1.x, Java 6 or 7 (including OpenJDK)

• Open source web platform

• Supports management, sharing and exploration of data and models in systems biology

• Integration of model simulations, representation of experimental data

• Supports data annotation and standardization

openBIS (Bauch et al 2011[45])

• Operating system: Unix-like operating system such as Linux

• Java Runtime Environment 1.6, PostgreSQL 9.0 or 9.1 installed

• License: Apache Software License 2.0

• Programming language: Java, Jython

• Open source framework for managing and analyzing complex biological research data

• Integrates with public databases

• For collecting, integrating, sharing, and publishing data

• Integrates with visualization and analysis tools

ISA (Sansone et al 2012[46])

• Various tools

• Open source frameworks and tools

• Manage life science, environmental, and biomedical experiments

• Provides rich description of experiment metadata

• Store and manage experiment metadata

• Integrates with analysis tools

• Sharing and publishing tools

BioBankWarden (Ferretti et al 2017[47])

• Server Apache HTTP in Ubuntu Server version 14.04

• PostgreSQL (version 9.0) installation

• Web-based system for translational cancer research

• Manage, store, and integrate clinical, biomolecular, and biomaterial cancer data

• Manages disease research groups and research projects

• Integrates with analysis tools

WebBioBank

(Rossi et al 2014[48])

• Server:Microsoft SQL Server

• Web-based systems for research data collection and biosignal analysis (Parkinson's disease)

• Used in multicenter studies to integrate clinical and physiological data

• Integration of analysis tools

• Data export tools

DataLad[14]

• Operating Systems: Windows, Linux, and macOS

• Python, git-annex, Git

• Open source and free data management system

• Data tracking and structuring

• Collaboration support

ELN

LabArchives[49]

• Cloud-based software

• Operating system: iOS Mobile App version 3.0.2: iOS 11 or higher, Android Mobile App version 3.0.0b104: Android OS 5 or higher

• Browser: Chrome, Firefox, Safari, Microsoft Edge

• Create and save electronic lab notes

• Categorize and search research data

• Free plan for small teams

• Supports document versioning and revision control

• Integrates with visualization tools

• Integrates with laboratory instruments and devices

Labguru[50]

• Cloud-based software

• Operating System: macOS 10 and upper versions, and Windows 10

• Browsers: Apple® Safari®, Mozilla® Firefox®, Google Chrome™, and Microsoft Edge®

• Web-based ELN for experiment data and observations

• Inventory management of lab supplies and reagents

• Experiment and protocol management

• Collaboration tools for team work

• Integration with scientific software and instruments

• Compliance and security features

SciNote[51]

• Cloud-based software

• Operating System: Android 5.0 (API level 21) and later, iOS version 11 and later

• Browser: Chrome, Firefox, Safari

• Leverages Docker technology

• Runs on PostgreSQL and MS SQL

• ELN for managing laboratory data and processes

• Integration with scientific software and instrumentation

• Collaboration tools for teamwork

• FDA 21 CFR Part 11 compliance

• Integration with visualization tools

Elabjournal[52]

• Cloud-based software

• Operating system: iOS und Android

Local installation:

• Application server: Microsoft Windows Server 2019

• Database server: Ubuntu Server 18.04 LTS

• Microsoft Office Online Server (OOS): Microsoft Windows Server 2019

• Reverse proxy/load balancer and firewall

• ELN for managing laboratory data and processes

• Integrates with data export tools

• Integrates with visualization tools

• Integrates with other laboratory tools and software

RSpace[53]

RSpace Components

• Java Web Application

• Apache Tomcat or similar web application

• MariaDB database server

• Database initialization script

Web Server

• Ubuntu 20.04 LTS or 22.04 LTS or Debian 12

• Java SDK version 17

• Apache Web Server v2.4.

• Apache Tomcat 9.x

• MariaDB 10.3.

Hardware

• Linux-based server with at least 8 GB RAM

End user browser requirements

• Safari, Firefox and Chrome

• ELN for researchers to create and manage lab notes and research data

• Integration with other laboratory tools and software

• Collaboration tools for working with teams

• Built-in reporting and data export tools

eLabFTW[54]

• Cloud-based available

• Operating System: Windows, Mac OS X, Linux, BSD, Solaris, etc

• Must be installed on a server

• Containerization technology like Docker or Podman

• MySQL database

• Open-source, web-based lab notebook for recording and managing experiment data

• Built-in reporting and data export tools

Abbreviations: CTMS, clinical trial data management systems; eCRF, electronic case report form; EDC, electronic data capture; ELN, electronic laboratory notebooks; GUI, graphical user interface.



#

Step 4: Applying Research Data Management Criteria and Evaluation: Towards a Custom Solution

We conducted an internal review of various existing solutions, to assess their suitability for our specific needs. While many of these platforms offer a broad range of features and applications, they frequently fell short in several key areas. These included data storage options, either cloud-based or requiring a local server, as well as ease of implementation and administration. Even seemingly straightforward solutions like DataLad[14] or research electronic data capture (REDCap)[15] come with their own challenges, such as the need for administrative rights and complex setup processes. Additionally, cost considerations were a factor. We also noted that many of these tools lack support for clinical classification systems, which would enable the use of international standards for data capture, despite the feasibility of implementing such features. A detailed rationale for the selection of various tools under consideration is elaborated upon in the “Results” section.

This outcome prompted the development of our custom solution, tailored to meet our defined requirements. SQLite was chosen as a serverless and self-contained database management system. It is well-suited for small applications and research projects due to its lightweight and portable design. The entire database is stored in a single file, making it easy to manage, backup, and share. SQLite, compatible across platforms and adhering to the atomicity, consistency, isolation und durability (ACID) standard,[16] is a continuously evolving open-source project supported by the community.[17] Leveraging HTML5/CSS for front-end development, we utilized Electron to create a cross-platform desktop application, ensuring consistent user experience and simplifying development and maintenance due to the extensive use and documentation of these web technologies.


#

Step 5: Implementation of the Database Design

We have based our design on the i2b2 Common Data Model (CDM)[18] star schema, prevalent for its efficient querying and analysis of clinical research data. This model enhances our database by offering scalability to accommodate growing clinical research datasets. Despite relying on join operations, its normalized structure ensures data consistency, efficient storage, and easier updates, thereby preventing data redundancy. Crucially, its global, unified representation accelerates data access and expands analytical capabilities across multiple subject areas by eliminating the need for separate tables.


#

Step 6: Implementation of Data Security

GDPR compliance was a key focus during development, necessitating data security at two levels. Access to the SQLite database (DB) is controlled by assigned read/write permissions within the local or network drive. Second, the user interface (UI) utilizes a JavaScript class to dynamically generate user-specific views and handle “Create, Read, Update, Delete” operations, thereby ensuring secure management of database entries. To comply with GDPR regulations, all data and data operations are tagged with a specific date, ensuring that research data are only retained for as long as is necessary to fulfill its intended purpose. At the end of this period, or at the data subject's request, the data can be securely deleted, in-line with the GDPR's “right to be forgotten” provision.


#

Step 7: Implementation of Data Quality Control

To reduce erroneous data entry and improve data quality, we implemented type validation and duplicate detection methods directly in the input and import functions within the UI. In addition, we implemented logic-based rules based on Clinical Quality Language (CQL)[19] that provide direct user feedback on any rule violations during data entry and import. Integrating CQL using the CQL Framework (GitHub repository: https://github.com/cqframework) into our research database involves a two-step process. Firstly, CQL statements are transformed into JSON ELM (JSON expression logical model) representations. Secondly, the converted rule is interpreted and executed through JavaScript code. [Supplementary Fig. S2] (available in the online version only) provides a graphical illustration of the process.


#

Step 8: Designing the Standard Views for the Graphical User Interface (Front-end)

During development, we delineated a data pathway comprising (1) new subject entry, (2) new visit creation, (3) individual visit observations, or (4) fixed observations within a visit, resembling a Clinical Research Form (CRF). [Fig. 2] provides an illustration of this data path alongside the final UI.

Zoom Image
Fig. 2 Illustration of the main views for entering clinical research data into the database using the UI. UI, user interface.

#

Step 9: Validation of Data Consistency and User Satisfaction Survey

For data validation during import and manual data entry, we utilized the data collected as part of our use case (Step 2). We progressively compared the representation of the data in the database with the collected data in terms of concept, data type, values, and temporal alignment, and made step-by-step adjustments. A detailed description of this can be found in the “Results” section and in [Fig. 3].

Zoom Image
Fig. 3 Step-by-step import process demonstrating the reduction in error rates throughout each stage of the data import procedure. CQL, Clinical Quality Language.

To evaluate user experience, we conducted a streamlined questionnaire-based survey, focusing on our DB front-end (see [Supplementary Table S2] [available in the online version only]). After a 30-minute introduction and an hour of independent work with UI, 10 participants responded to 18 questions on a 1 to 5 agreement scale.


#
#

Results

This project involved developing an SQLite-based research database template and a user-friendly front-end for data entry. The database template, along with precompiled front-ends for Windows, macOS, and Linux, are available at the given GitHub repository: https://github.com/stebro01/research_database_sqlite_i2b2.git.

In this section, we present (1) expert interview findings on on-demand analysis, (2) DB structure, (3) implementation of data validation tools using real-time clinical data, (4) user feedback on our solution, and finally, (5) specific use cases for our solution.

Expert Interview Findings on On-demand Analysis

All respondents were directly involved in the design and conduct of experimental and clinical trials. Their level of experience was subjectively considered to be at least intermediate. The essential requirements identified by the respondents are presented in [Table 1]. A key finding from the interviews was that none of the respondents used a standardized method or system for managing clinical research data. All participants expressed a strong interest in a software solution easily implemented within their existing IT infrastructure.


#

Evaluation of Existing Tools and Rationale for Selection

To assess existing solutions, we conducted comprehensive research, primarily focusing on EDC tools; the findings are summarized in [Table 2]. While we identified several potential candidates, the following outlines our reasoning against their selection.

REDCap,[15] for instance, is a renowned tool for capturing and managing data, particularly in large clinical studies. However, it falls short on several of our criteria. Firstly, REDCap operates as a server-based solution with cloud storage, which contradicts our requirement for offline capabilities. Secondly, any changes to the data structure require approval from the REDCap team, limiting flexibility. Lastly, it lacks built-in support for clinical classification systems like SNOMED-CT and LOINC and has a steep learning curve. DataLad,[14] on the other hand, is a free, open-source platform allowing users to manage data on their local machines. While it offers advanced features like data versioning and structured storage, its technical demands and command line-based features make it less user-friendly. Moreover, it lacks specialized support for clinical classification systems. OpenClinica[20] is another notable tool that offers both free and paid versions. We ruled it out primarily due to the lack of explicit support for clinical classification systems and their IT requirements.


#

DB Structure and Features

DB Structure

Built upon the i2b2 CDM[18] star schema, our database structure, detailed in [Supplementary Table S3] [available in the online version only], integrates specific tables, views, and triggers. A comprehensive technical description is available in both the [Supplementary Materials] and the associated GitHub repository.

Utilizing the i2b2 schema, our data structure centers around the OBSERVATION_FACT table, with extensions for auditing purposes, such as IMPORT_DATE, DOWNLOAD_DATE, UPDATE_DATE, and UPLOAD_ID. In the current release, auditing primarily involves timestamping new data (via IMPORT_DATE) and tagging it with the creator's ID (UPLOAD_ID), as well as logging any data modifications (UPDATE_DATE). We have designed a table called “NOTE_FACT” intended for more comprehensive auditing, which would capture differential data changes, data deletions and restorations, and database query access. This feature is currently inactive in the UI but is on the roadmap for future releases.

Data Concepts

Data points are standardized in the CONCEPT_DIMENSION table, utilizing globally recognized classifications like international statistical classification of diseases and related health problems (ICD-10), LOINC, and SNOMED-CT for interoperability, promoting data consistency and exchangeability across health care systems.[21]


#

User Management

We manage user access via the USER_MANAGEMENT and USER_PATIENT_LOOKUP tables, controlling data access and permissions, ensuring data are only accessible to authorized users.


#

Clinical Quality Language

To facilitate quality management, a dedicated CQL_FACT table stores CQL[19] rules and their JSON ELM representations. The CONCEPT_CQL_LOOKUP table links these rules to clinical concepts in the CONCEPT_DIMENSION.


#

User Feedback

Finally, the NOTE_FACT table collects user feedback and allows note creation for individual subjects, paving the way for future enhancements like reminder or appointment management systems or a more extensive auditing system.


#
#

Front-end (Graphical User Interface)

We have developed a UI that simplifies data input through comma-separated-values (CSV) imports or a CRF-like interface, with the graphical user interface (GUI) designed to focus on individual clinical visits, as demonstrated in [Fig. 2]. Further, the GUI grants administrative control over user, provider, and location tables, in addition to managing CONCEPTS, inclusive of a SNOMED application programming interfaces (API) link. Our UI also streamlines data exchange processes by facilitating the import/export of all nonobservational data in JSON format.

Observational data can be imported and exported via a standardized CSV file. Additionally, we have incorporated HL7 JSON support for interchange. In the current version, we specifically focus on the HL7-CDA (version 2.0.1) standard, limiting our support to the “Composition” resource type. Within this resource, we employ properties such as “subject,” “event,” and “section” to encapsulate relevant patient and observational data.

By default, observational data are exported in a pseudonymized manner (creating a UID for each exported object).


#
#

Versatile Data Capture for Clinical and Neuroscience Research

Our RDM system has been designed with a strong focus on versatility, particularly for the collection of diverse clinical and neuroscientific data types. The database accepts standard text, numeric, and date formats, providing a foundation for the collection of behavioral data, psychometric assessments, and clinical findings. It is also equipped to handle raw data types, facilitating the storage of images, PDF documents, and specialized reports. Our data schema covers a wide range of research needs, including laboratory values and subject-specific documents such as privacy statements. Using SQLite's capabilities, the database is theoretically capable of storing more complex datasets such as neuroimaging and genetic information. However, it is important to note that we have not yet validated the system's performance with these larger datasets.


#

Implementation and Application of Data Validation to Real-world Clinical Data (Use Case)

We have introduced real-time data validation, incorporating type checking, CQL rule enforcement, and duplicate checks to ensure data integrity directly into the data entry and import routines. In the current version, the class responsible for adding data to the database performs double-entry checks. If a given subject already has a specific concept with the same value, the user is notified within the UI. At this point, the user has the option to either skip the entry or proceed with adding the data. However, since the data are stored in an SQLite database, double entries can also be managed directly through structured query language (SQL) statements.

We implemented these features incrementally using real clinical data from our use case. In total, we used a data matrix with 8,985 data entries from 56 subjects, consisting of 82 different types (concepts). The iterative development of the import function, illustrated in [Fig. 3], systematically addressed data types, concept-conforming data, and erroneous data like invalid character sequences or unsupported special characters.

We logged errors and their frequency at each optimization stage, only modifying the initial data matrix when necessary (e.g., incorrect data type or significant misspelling of concepts/answers). Alterations to the original Excel spreadsheet were successively stored and documented via the KNIME analytics platform.[22]


#

User Feedback on Our Solution

Overall, the survey ([Supplementary Table S4], available in the online version only) results indicated a high level of user satisfaction. Most respondents found the system easy to use and the majority expressed confidence in their ability to use the system without technical support. The integration of different functions within the system was well received and users reported that they could learn to use the system quickly. Users found the process of entering new subjects, visits, and observations simple and straightforward. The layout was considered user-friendly and provided a good overview of relevant data. Users also found it easy to export data from the system.


#

Specific Use Cases for Our Solution

Herein, we delineate potential application scenarios of our database concept for clinical research queries, with the decision-making process illustrated in [Supplementary Fig. S1] (available in the online version only). First, a data scheme (like a CRF) is created akin to designing an Excel sheet. More concepts can be added to the database, currently comprising over 800 in the CONCEPT_DIMENSIONS table, if required. The researcher can then store data in the database through the front-end, with three primary constellations:

  1. Single user: Ideal for clinical researchers with limited subjects, where the user creates a CRF, store data in their local SQLite database, and directly manage the data with our solution's database structure and front-end.

  2. Multiple users: A shared central data repository stored on a network drive, where users only access their created subjects, while an administrator can collate and evaluate all data. Role-specific rights control data access, and our solution also facilitates data merging from multiple users for joint analysis.

  3. Multiple users/separate DB: Each user processes different subjects in a local DB version, with data later combined into a main database using the HL7 export and import function. Suitable for users at disparate locations needing later database consolidation, our solution provides an HL7 export function for this purpose.

Challenges in scenarios 2 and 3 may involve ensuring data consistency and integrity when merging data from multiple users and databases, necessitating clear data management and validation rules. In scenarios where multiple users are working on the same database, there is the potential for data conflicts when modifying, adding, or deleting the same data. A possible solution to this problem is to implement a locking mechanism for individual subjects or visits. When a user is editing, the data would be set to “read-only” for other users; this feature is currently under development. In addition, the current version lacks an “undo” function, which will be included in future updates. Merging data from different sources also presents challenges, particularly when dealing with new data, matching data with identical identifiers, or managing different versions of the same data. Currently, the import process is straightforward: new data are added if no matching subject exists, and existing data are appended if it does. If conflicting data with the same creation date are detected, the version with the most recent update is used. However, this approach may not be suitable for all cases, so users are encouraged to implement custom SQL statements or code as needed.

Finally, as the database can be stored on a local or network drive, it can be accessed remotely by various means, including remote desktop or network drive access, making it suitable for home office setups.


#
#

Discussion

Our work addresses the pressing need for a comprehensive data platform that simplifies the management of clinical research data, with a special emphasis on the needs of clinicians involved in research. Our SQLite-based database solution emerges as a user-friendly, secure, and efficient platform that particularly shines in scenarios where sophisticated server architectures or complex infrastructures are not readily available.

At the inception of our project, we embarked on a comprehensive needs analysis and rigorous internet research to identify the specific requirements and shortcomings of the existing alternatives. This inquiry revealed a pressing demand for a straightforward and intuitive system for data collection and management. The prevalent solutions often suffered from complex installation processes and operational challenges, making them ill-suited for small- and medium-sized projects. They frequently turned out to be excessively complicated and cumbersome, thereby diminishing their appeal for projects of smaller scale.

Acknowledging this gap, our aim was to develop a simpler, user-friendly alternative that offered ease of installation and manageability. Our concept, tested and validated in a clinical use-case scenario, was designed to fulfill the exigencies of real-world practice. To fine-tune our system and cater to user needs effectively, we conducted a user satisfaction survey at the culmination of the project. The invaluable feedback thus gathered will aid us in further refining and optimizing our system.

What do Researchers Need?

Scientific researchers often play multiple roles including data collector, manager, and analyst.[23] However, researchers usually lack proficiency in data management, even though it is crucial in the latter part of the research workflow.[4] Although researchers are experts in their respective fields, their grasp of RDM is usually subpar, as revealed by studies and interviews. Many are unaware of standardized RDM tools and methods. Our survey and another study[1] revealed a need for defined data and metadata standards, set workflows, and a reliable RDM infrastructure. The lack of standardization and automated data exporting and anonymizing often hinders public data sharing. For instance, less than 40% of functional neuroimaging data are openly shared.[23] Implementing RDM standards can enhance the quality and volume of scientific publications.[24]

Nonetheless, many researchers see themselves as key to improving RDM and implementing open science practices.[25] Efforts are underway in Germany, like the establishment of an NFDI, targeting fields such as clinical neuroscience to ensure GDPR-compliant research.[2] The complex process of adopting an RDM culture integrates technological, economic, and political aspects. The increasing relevance of the topic is evidenced by a surge in scientific publications since 2016.[26] Our local project aims to connect to evolving data infrastructure interfaces, emphasizing the use of standard classifications and established representation models for integration.

Our expert interviews with clinical scientists engaged in daily patient care highlighted a persistent need for research tools that seamlessly integrate into existing workplace infrastructures. These tools must effectively compete with ubiquitous applications like Excel. For many researchers, transitioning between their professional and research workspace is not as straightforward as one might assume. Consequently, there is a high demand for a bespoke database solution that functions as a “one-click” alternative to Excel, simplifying the user experience while facilitating scientific inquiry.


#

Research Data Management in Scientific Practice

In a recent review from 2016, Perrier and colleagues[27] analyzed a total of 301 articles that examined RDM procedures in academic institutions. They were able to show that most of the work deals with the creation and initial storage of data (creating data and processing data as described, for example, in the Research Data Lifecycle Framework of the UK Data Archive). It is precisely this aspect of the scientific workflow that we are trying to improve here.

When it comes to managing clinical research data, researchers have a range of options at their disposal. These options include simple spreadsheet-based solutions to more complex database systems that demand advanced technical expertise to operate. Excel spreadsheets are one of the most used solutions by clinical researchers.[28] However, despite being a popular tool for data analysis, Excel lacks the required structure and functionality to manage complex clinical research data. Large datasets can quickly render Excel sheets unwieldy and error-prone, with error rates reported to be as high as 7 to 80%.[29] [30] Moreover, spreadsheets do not provide the necessary security and backup functionality needed for sensitive clinical data.

To address these limitations, several popular web-based solutions are available, such as REDCap,[15] OpenClinica,[20] and LORIS.[8] [Table 2] in the methodological section provides a systematic overview of the most common solutions.


#

Implementing Findable, Accessible, Interoperable, and Reusable Principles and Data Sharing

When collecting clinical research data, researchers must adhere to well-defined standards and rules. The use of different International System of Units (SI) units alone presents a common nontrivial challenge. Only when a high standard is established during data collection can general concepts such as the FAIR principles[5] be effectively implemented.[31]

Our SQLite-based database solution uniquely addresses the challenges of clinical RDM by integrating user needs, technical requirements, and industry standards. It prioritizes clear storage structure, ease of access, data security, and rights management, ensuring an uncomplicated and efficient RDM process.

The solution aligns with the FAIR principles, employing standardized data and metadata formats, facilitating data sharing and integration, and securing role-based access control. Each data observation is linked directly to a clinical concept and a visit, enriching it with valuable metadata, thus aligning with the researchers' “minimum requirement” approach for metadata entry.[27]

Our choice to integrate HL7, particularly HL7/JSON, into our SQLite database improves functionality and adherence to FAIR principles. This globally recognized standard ensures seamless data exchange with other systems, facilitating efficient data sharing and collaboration.[32] The HL7/JSON integration also enhances data accessibility, allowing data to be easily parsed across platforms and applications due to its human-readable format.

Our SQLite database system offers a structured storage system for diverse clinical data types, enabling efficient search and filtering. This minimizes errors due to manual data manipulation. The user-friendly interface allows easy data management without demanding extensive technical knowledge, while the provision for CSV export supports further analysis.

Security and backup features safeguard sensitive clinical data against unauthorized access and loss. The database is stored on a network drive with defined user access, enabling secure, easy access from researchers' clinical workplaces. This is especially beneficial for research in resource-limited settings where advanced server infrastructure may not be available.

The system is also designed for straightforward adoption. By simply customizing the SQLite DB template and installing the front-end, researchers can start data collection and always maintain direct control over their data, making our solution an effective tool in current scientific practice.[2] Moreover, the system's flexible design can easily adapt to future requirements and be repurposed for subsequent studies. Its standardized data template is not only conducive to data exchange but also effectively addresses the crucial need for interoperability.

It is important to note that the current version of the database solution is a small, local project. The current iteration can be seen as a feasibility study designed to directly investigate its implementation, technical architecture, and user satisfaction. While it is possible to use this solution in larger clinical trials with a multicenter approach, such as shared network drives, it is not currently designed for this purpose. If the project is accepted and actively utilized by our local research groups, we would like to expand it further and implement new features. However, it is crucial to acknowledge that in the context of regulated clinical trials, an extensive certification process for the application would be required.


#

Meeting a Real-world Scenario: Use Case with Longitudinal Data from Mild Cognitive Impairment Patients

To demonstrate the applicability of our database solution, we selected a use case involving the longitudinal tracking of data from patients with MCI. This real-world scenario includes a variety of data types, including sociodemographic, clinical, laboratory, and neuropsychological data, as well as scores and parameterized neuroimaging data from MRI, CT, and electrophysiological studies such as EEG. We aimed to accommodate multiple classification standards, such as ICD-10, LOINC, and SNOMED-CT, and manage data collected at different points in time.

While the SQLite database is theoretically capable of storing large files, our initial implementation focuses on storing structured reports of processed neuroimaging data. Specifically, we are focusing on VBM, SBM, or Functional magnetic resonance imaging (FMRI) analyses presented in JavaScript Object Notation (JSON) or Extensible Markup Language (XML) formats. These structured reports can be fully integrated into the database using custom concept definitions.


#

Future Directions and Improvements

Looking to the future, there are several potential enhancements and improvements that could be made to the proposed database solution. One possible avenue for development is the direct integration with open science data hubs, allowing for seamless data sharing and collaboration across multiple institutions. This would enable researchers to easily contribute and access data from a centralized repository, promoting greater transparency and reproducibility in research.

In addition to open science integration, another potential area for improvement is the export functionality of the database solution. Currently, the solution supports CSV and HL7 JSON export formats, but there is room for expansion to other popular analysis tools such as SPSS, Python, and R. By expanding the range of supported export formats, users will have greater flexibility in conducting further analysis on the collected data.

Exploring the implementation of a Fast Healthcare Interoperability Resources (FHIR) interface could be very beneficial. As an established standard for electronic health information exchange, FHIR could significantly improve the interoperability of our database. This would streamline the exchange and integration of data across different health care systems and platforms. In this context, Translational Research Informatics and Data-management grid's focus on service-oriented architecture, and proven interoperability strategies offer valuable insights.[33]

Another potential area for improvement is the implementation of automatic lifecycle management and backup routines. Currently, these tasks are performed manually by the database administrator, which may be time-consuming and prone to human error. By automating these tasks, the database solution can ensure greater data consistency and reliability.

One significant limitation of our database solution is that our primary consultation was with clinical scientists who have limited experience in RDM. However, the inclusion of many senior scientists in our consultations lends credence to the expressed need for a solution that is easy to implement. This would enable them to integrate RDM practices seamlessly into their existing workflows. Additionally, the transparency of data storage in a local database file may alleviate concerns around data privacy and ownership, potentially making our system more readily acceptable than preexisting, highly integrated alternatives.


#
#

Conclusion

In this study, we have developed and evaluated a user-friendly SQLite database with a front-end for streamlined clinical RDM.

While we do not consider our database solution to be superior to existing systems that address FAIR principles, it does offer distinct advantages in certain contexts. Our system is designed as a “one-click” implementation with a locally stored SQLite database, offering a straightforward setup for clinical scientists lacking database management expertise. By lowering the entry barriers in this way, our solution serves as a catalyst for establishing RDM practices in labs that might otherwise be hindered by technical complexity or budget constraints. Consequently, our system adds a layer of FAIR compliance to research environments that may currently lack it, enhancing the overall FAIR landscape.

A user-satisfaction survey confirmed high acceptance among our target group. However, further refinement and evaluation are needed to optimize performance, usability, and data security across varied research applications.


#

Clinical Relevance Statement

  • Our SQLite-based database delivers a user-friendly, effortlessly installable solution for research data storage, processing, standardization, and sharing.

  • This database proficiently manages multimodal and longitudinal data, and seamlessly integrates clinical classification systems like SNOMED-CT and LOINC.

  • By adhering to FAIR principles and implementing standards for data storage and exchange like HL7-CDA JSON, our database not only streamlines RDM but also enhances the reproducibility and publication of scientific research.


#

Multiple Choice Questions

Question 1: Which statement accurately reflects the use and limitations of Excel spreadsheets in managing clinical research data?

  • Excel spreadsheets are ideal for managing complex clinical research data due to their advanced functionality and structure.

  • Large datasets can make Excel sheets prone to high error rates, ranging from 7 to 80%.

  • Spreadsheet-based solutions provide the necessary security and backup functionality for sensitive clinical data.

  • Clinical researchers seldom use Excel spreadsheets for data management.

Answer: b.

Question 2: Which of the following statements best represent one of the FAIR principles for data management?

  • The “Findable” principle emphasizes the use of persistent identifiers (e.g., DOIs) and rich metadata to ensure that datasets are uniquely identifiable and easily discoverable by both humans and machines.

  • The “Secure” principle focuses on safeguarding data integrity, confidentiality, and availability.

  • The “Exclusive” principle emphasizes controlled access to data, allowing only a select group of individuals or organizations to access and use the data.

  • The “Closed” principle refers to restricting data access to a specific group or organization, thereby limiting its availability, and preventing broader reuse.

Answer: a.


#
#

Conflict of Interest

None declared.

Protection of Human and Animal Subjects

The study adhered to the World Medical Association Declaration of Helsinki's ethical guidelines for research involving human subjects and received approval from our local Ethics board (Reference: 2022-2658-Daten).



Address for correspondence

Stefan Brodoehl, MD
Department of Neurology, University Hospital Jena
Am Klinikum 1, 07747 Jena
Germany   

Publication History

Received: 14 July 2023

Accepted: 22 November 2023

Accepted Manuscript online:
01 February 2024

Article published online:
27 March 2024

© 2024. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution-NonDerivative-NonCommercial License, permitting copying and reproduction so long as the original work is given appropriate credit. Contents may not be used for commercial purposes, or adapted, remixed, transformed or built upon. (https://creativecommons.org/licenses/by-nc-nd/4.0/)

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany


Zoom Image
Fig. 1 Flow chart illustrating the steps from the assessment of user and technical requirements, the definition of a concrete use case, the literature research, the development, and implementation of a solution concept to the validation of the result. RDM, research data management.
Zoom Image
Fig. 2 Illustration of the main views for entering clinical research data into the database using the UI. UI, user interface.
Zoom Image
Fig. 3 Step-by-step import process demonstrating the reduction in error rates throughout each stage of the data import procedure. CQL, Clinical Quality Language.