When the Submission Got Real
A principal data engineer at a pharmaceutical sponsor told me about her team's first real-world evidence submission to FDA in 2024. The clinical study had been completed in twelve months. The data engineering work to support the submission had taken twenty-two. The submission package included data from three EHR sources, two claims databases, one wearable device platform, and patient-reported outcomes from a custom app. Each source had its own data shape, its own quality issues, and its own provenance requirements.
She told me the experience taught her something the regulatory training had not emphasized. Real-world evidence regulations describe what the submission has to demonstrate. The data engineering work to produce that demonstration is substantially harder than the regulatory framework suggests. Most of her team's time went to harmonization, lineage, and quality control that the regulations assume but do not specify.
Real-world evidence has become a primary mode of clinical evidence generation through 2024 and 2025. FDA approvals based on RWE have increased meaningfully. Industry investment in RWE capabilities has followed. The data engineering reality has been catching up more slowly than the regulatory and scientific framework.
For organizations building RWE data engineering capability, the patterns that work in 2026 are recognizable. Knowing them up front reduces the gap between submission timeline and execution timeline.
Your Highest-Intent Buyers Are Sitting in Your CRM Four Times
Duplicate records are hiding your best leads. Identity resolution reveals true buyer intent and fixes your pipeline.
What RWE Pipelines Actually Have to Do
RWE pipelines have specific responsibilities that extend beyond typical analytical pipelines.
The first responsibility is multi-source harmonization that survives regulatory scrutiny. Data from EHRs, claims, devices, patient apps, and other sources all describe the same patients. The harmonization has to produce a unified view that auditors and regulators can examine. The lineage from each source through the harmonized view has to be reconstructable.
The second responsibility is quality control that is documented rather than implicit. Every quality decision (record exclusion, value correction, imputation, outlier handling) has to be traceable. Implicit quality decisions made in code without explicit logging produce gaps that regulatory review surfaces.
The third responsibility is reproducibility at the version level. The pipeline that produced the submission data has to be reproducible from versioned components. Same source data plus same pipeline plus same parameters has to produce the same results. The reproducibility supports regulatory review that may happen months or years after submission.
The fourth responsibility is privacy compliance that does not degrade scientific value. The data flows have to maintain HIPAA, GDPR, and study-specific privacy commitments. The compliance work cannot undermine the scientific analysis. The two have to coexist.
These responsibilities are not optional for serious RWE work. Pipelines that lack them produce data that may be scientifically interesting but cannot be submitted with confidence.
The Three Patterns That Work for RWE Pipelines
Three architectural patterns handle most production RWE workloads.
The first pattern is staged harmonization with explicit quality gates between stages. Source data lands raw. The first stage applies source-specific cleaning and standardization. The second stage maps to a common data model (OMOP CDM, PCORnet CDM, or sponsor-specific). The third stage applies study-specific transformations. Quality gates between stages document what passed and what failed at each step.
The pattern produces traceable lineage. Auditors can ask about any specific record's path through the pipeline and get a reconstructable answer. The quality gates produce explicit decision points rather than implicit decisions buried in transformations.
The investment is in the staging discipline. Pipelines that collapse multiple stages into single transformations have less traceability and produce harder regulatory questions.
The second pattern is metadata-driven configuration rather than code-embedded logic. Study parameters, inclusion criteria, exclusion criteria, variable definitions all live as configuration. The pipeline reads the configuration and applies it. The configuration is versioned alongside the code.
The pattern produces reproducibility. A specific submission's pipeline runs with specific configuration. Re-running the pipeline with the same configuration produces the same results. Regulatory questions about specific submissions can be answered by examining the specific configuration that ran.
Configuration-driven pipelines also support multiple studies sharing pipeline infrastructure. Each study has its own configuration; the pipeline code is shared. The scalability matters when an organization runs many concurrent studies.
The third pattern is provenance tracking as a first-class concern. Every output record has metadata indicating which source records contributed to it, which transformations applied, which configuration ran, and which pipeline version executed. The provenance is queryable for any output record.
The pattern produces audit defensibility. When a regulator asks where a specific number came from, the answer exists in the provenance metadata. Pipelines without explicit provenance tracking produce answers that require code archaeology rather than data query.
The Common Source Categories and Their Specifics
Three source categories dominate RWE pipelines, each with specific engineering considerations.
EHR data presents the deepest harmonization challenges. Different EHR systems use different terminologies, different field structures, different update patterns. EHR-to-common-data-model mapping requires substantial domain expertise plus engineering. The work is unglamorous and load-bearing.
Claims data is more standardized than EHR data through ANSI X12 and similar standards. The harmonization work is lighter. The data quality issues are different (timing of claims, completeness for specific populations, payer-specific quirks). Claims data benefits RWE by providing comprehensive coverage where EHR data may have gaps.
Device and wearable data has grown rapidly through 2024 and 2025. The data is voluminous, sometimes noisy, and protocol-diverse. Continuous glucose monitors, activity trackers, sleep monitors, cardiac monitors all produce data with specific characteristics. The engineering challenge is volume management plus quality control for sensor-specific issues.
Patient-reported outcomes from custom apps or standardized instruments add subjective measures to the objective clinical data. The engineering is generally simpler than the other sources. The integration matters because patient-reported data captures dimensions that clinical data does not.
The pipeline architecture has to handle all four categories cleanly. Most RWE workloads include at least three of these source types.
What the Regulatory Framework Expects
FDA's RWE framework has evolved through 2024 and 2025. The expectations cluster around three areas.
The first area is data relevance to the regulatory question. The RWE has to address the question the submission asks. Generic clinical data is not sufficient; the specific population, intervention, and outcomes have to match the regulatory question. The pipeline produces the specific dataset for the specific question.
The second area is data reliability with documented quality controls. The data has to be reliable enough to support regulatory decisions. The quality controls applied have to be documented. The documentation has to be specific rather than aspirational. "Data quality is maintained" is not sufficient; "specific quality checks applied at specific stages with specific thresholds" is.
The third area is data fitness with appropriate analytical methods. The analytical methods applied to the data have to be appropriate to the data characteristics. Pipelines that produce data without explicit characterization of strengths and limitations make analytical method selection harder.
EU regulators (EMA, MHRA, and others) have parallel frameworks with regional specifics. Organizations operating across regulatory regions have to satisfy multiple frameworks simultaneously. The engineering work overlaps substantially across frameworks even when the specific submissions differ.
What This Costs at Production Scale
Building production-grade RWE data engineering capability typically requires a dedicated team of eight to twenty engineers depending on study volume. The team spans data engineering, clinical informatics, regulatory affairs, and quality assurance specialties.
The tooling investment includes common data model platforms (OHDSI tooling, custom CDM implementations), pipeline frameworks (typically standard data engineering tools with RWE-specific extensions), and observability infrastructure designed for the regulatory audit requirements.
The total annual cost for serious RWE capability typically lands in the $3M-$15M range for pharmaceutical sponsors. The cost is justified by the regulatory submissions the capability supports.
The alternative cost is the cost of submission delays or regulatory findings that pipeline issues produce. Each delayed submission has both direct cost and opportunity cost. Each regulatory finding that traces to data engineering has remediation cost plus potential approval impact.
Why Confident AI Scores on Bad Data Are Dangerous
Your models aren’t wrong. Your data is. Here’s how real estate teams fix AI failures before they cost millions.
What Logiciel Does Here
Logiciel works with pharmaceutical, biotech, and CRO data engineering teams building or modernizing RWE capabilities. The work is typically structured around pipeline architecture assessment, regulatory readiness review, and capability buildout where gaps exist.
The Data Engineering for Healthcare framework covers the broader healthcare data engineering patterns that RWE extends. The Healthcare AI Implementation framework covers the AI applications that often connect to RWE pipelines.
A 30-minute working session is enough to assess your current RWE data engineering capability against production-grade patterns.
Frequently Asked Questions
What common data model should I use?
OMOP CDM has the broadest adoption and the largest ecosystem. PCORnet CDM fits research network workflows. Sponsor-specific CDMs sometimes make sense for organizations with specific therapeutic area focus. The choice usually follows research collaboration patterns more than technical capability.
How do I handle data from sources that resist sharing?
Through federated analysis patterns. The pipeline brings analysis to data rather than data to analysis. Frameworks like OHDSI support this pattern. The architecture is more complex than centralized analysis but addresses data sharing constraints.
What about synthetic data for RWE?
Synthetic data plays a role in pipeline development and method validation. It does not substitute for real data in regulatory submissions. The regulatory framework does not generally accept synthetic data as evidence for the questions RWE submissions address.
How does this work for rare disease research?
Rare disease RWE faces additional challenges due to small populations. The data engineering has to handle scarcity carefully. Federated approaches that combine data from multiple sources without centralizing often work better than centralized approaches at small population sizes.
What is the timeline for RWE data engineering work supporting a submission?
Variable. For mature organizations with established pipelines, six to nine months. For organizations building capability for the first submission, eighteen to twenty-four months. The opening story's timeline reflects an organization in the middle of this range. Sources: - FDA, "Framework for FDA's Real-World Evidence Program," 2024 update - OHDSI documentation, 2024