Author + information
- Randi Foraker, PhD, MAa,∗ (, )
- Douglas L. Mann, MD, Editor-in-Chief, JACC: Basic to Translational Scienceb and
- Philip R.O. Payne, PhDa
- aInstitute for Informatics, Washington University School of Medicine, St. Louis, Missouri
- bCenter for Cardiovascular Research, Cardiovascular Division, Washington University School of Medicine, St. Louis, Missouri
- ↵∗Address for correspondence:
Dr. Randi Foraker, Institute for Informatics, Washington University School of Medicine, 4444 Forest Park Avenue, Suite 6318, St. Louis, Missouri 63110.
As noted in this Editor’s Page previously, the rising cost of developing new cardiovascular therapies cannot be sustained in the long-term (1). Accordingly, there is a critical need for new methodologies that can improve the speed, efficiency, and success rate of efforts to develop new therapeutic strategies for cardiovascular disease (2). Although randomized clinical trials remain the gold standard to evaluate drug responsiveness, phase III clinical trials are costly due in part to the large numbers of patients that need to be enrolled and the long follow-up period needed to detect meaningful differences in survival or clinical outcomes (3). As an alternative to randomized clinical trials, clinical effectiveness studies can be conducted to evaluate drug responsiveness in diverse patient populations (4,5). Such pragmatic approaches to evaluate drug responsiveness can be randomized or nonrandomized. If validly conducted, such studies can provide decision-makers with evidence from patients who are representative of those presenting to a clinic for a particular problem, thus accelerating translation into general clinical practice.
Treatment decisions that must be made by clinicians include: Of existing treatments, which is best for an individual patient; what is the best treatment approach for patients with certain medical conditions; and how does one treatment compare with other existing alternatives? Ideally, clinicians would have the ability to query the electronic health record or another patient database for treatment efficacy from a population of similar patients in order to guide treatment decision-making for an individual patient (6). In the absence of these types of data or that of clinical trials, the quality of evidence available to answer these critical questions is frequently insufficient. Rarely are studies conducted to assess treatment effectiveness or patient outcomes in real-world practice settings, and often trials are not designed nor powered to evaluate the comparative effectiveness of treatments (7).
To fill this gap, data are needed, not only to know how best to treat individual patients, but also to develop and refine evidence-based treatment guidelines. Decision-makers in need of this information include policymakers, payers, health care organizations, clinicians, and patients. To precisely estimate effect sizes, researchers must have access to sufficiently large and representative datasets. Although data sharing is an option to increase the sample size of an eligible study population, many institutions lack the infrastructure and support to do so (8). As a result, there are few networks of investigators who are willing and able to share data at the necessary scale in order to study drug responsiveness. This is a critical obstacle to progress, as data re-use and data sharing are essential for multisite, generalizable insights.
Synthetic data derivatives offer one potential solution to the aforementioned problems (9). Synthetic datasets are generated from existing datasets and maintain the statistical properties of the original dataset. Importantly, rows of observations in synthetic datasets do not correspond to identifiable individuals (rows of data) from the original dataset. Thus, synthetic data derivatives are quantitatively identical to patient-derived datasets, yet cannot be linked to the individuals from whom the data were derived (9). Because synthetic data contain no protected health information, the datasets can be shared freely among investigators or those in industry, without raising patient privacy concerns. In addition, research conducted using synthetic derivatives does not require institutional review board approval.
Notably, data synthesis differs from the anonymization or de-identification of protected health information through the removal of identifiable data elements or their obfuscation (10). Alternative approaches to synthetic derivatives include establishing a data enclave with restricted access and data-sharing requirements, or limiting access to only data that are relevant to a specific research question (11). Each of these alternatives does not ensure data privacy, because de-identified data can be re-identified with linkage to another data source, and security and confidentiality breaches can occur even with limited access to protected systems.
Using a data synthesis platform allows for the linkage of multiple sources of data before producing a synthetic derivative, and reduces data ownership concerns when combining data across organizational boundaries. Having the capability to combine datasets before synthesis results in a data product that provides a more comprehensive view of the patient, and facilitates the evaluation of factors related to drug responsiveness including those of health care quality and patient safety. For researchers, the ability to produce and share synthetic datasets can shorten the idea-to-insight time from years (as with expensive, lengthy clinical trials) to hours, and lessens legal and ethical barriers to data sharing. Not only does access to synthetic data allow for efficiencies in research, but the potential of synthetic data is great for saving time and money in drug development and responsiveness as well.
Can synthetic data be used to evaluate drug responsiveness? One of the major difficulties in developing new therapies relates to the inherent fragility of phase II trials. Because of cost constraints, the sample size of patients enrolled in early-phase trials is relatively small, and the number of drug doses that one can study feasibly is often limited. The size of phase II trials also restricts the range of endpoints that one can measure to gauge clinical effectiveness. Further, phase II trials are often performed in large academic medical centers that serve as tertiary and quaternary referral centers where the patient population may vary significantly from those studied in larger phase III trials.
Although speculative, one immediate application of synthetic datasets in phase II studies could be to generate groups of control patients that faithfully mimic the patients who are receiving active therapy in early phase clinical trials. If properly designed, these studies could be performed in a randomized, double-blind manner. Bayesian statistical methods could then be used to compare the response of patients receiving active therapy to patients enrolled in a synthetic control group. This would allow investigators to prioritize their precious resources to enroll more patients in the active therapy arms, which would also mitigate some of the statistical problems that occur when using small control groups that do not complement the demographics of the disease being studied. Another way in which synthetic data could be used is in the context of largescale and pragmatic trials that evaluate novel targeted therapies that involve genomic targets, insofar as conventional randomized clinical trials are often impracticable because of the large sample sizes that are required to demonstrate clinical effectiveness in this setting (12,13). Lastly, one can imagine using synthetic datasets to predict trends in rare diseases, which in turn could be used to design appropriately powered clinical trials that target clinically meaningful end points.
What are some of the limitations of using synthetic data to evaluate drug responsiveness? One potentially important limitation is that whereas synthetic models derived from existing datasets may replicate certain general trends of the dataset, they may not necessarily be able to predict specific trends within a dataset (e.g., all-cause death vs. cardiovascular death). Although this limitation remains theoretical at present, it may be problematic with respect to using synthetic datasets to evaluate novel therapeutics. Whether creating a larger derivative dataset that contains an adequate number of outcomes of interest in order to estimate drug effects accurately will satisfactorily address this issue remains an important question that will require further study. Second, there is no consensus about how best to create synthetic datasets. Fully synthetic datasets do not contain any original data, whereas partially-synthetic datasets may only de-identify or anonymize sensitive values. There are theoretical advantages and disadvantages to both approaches; however, there is no information with respect to which approach is better for predicting drug responsiveness. Lastly, at the time of this writing, the Food and Drug Administration has not yet approved the use of synthetic datasets for registration studies: it is simply too soon.
Reducing the cost of developing new cardiovascular therapies will require fundamental changes to the way in which we conduct preclinical and clinical trials in order to make them faster, cheaper, and more adaptable. Here, we suggest that the use of synthetic data derivatives may help with the development of new and novel cardiovascular drugs. As always, we welcome comments and suggestions from investigators in academia and industry, patients, societies, and all of the governmental regulatory agencies about your thoughts about the potential role of synthetic data in translational medicine, either through social media (#JACC:BTS) or by e-mail ( ).
- 2018 The Authors
- Mann D.L.
- Embi P.J.,
- Kaufman S.E.,
- Payne P.R.
- Longhurst C.A.,
- Harrington R.A.,
- Shah N.H.
- Mulder R.,
- Singh A.B.,
- Hamilton A.,
- et al.
- Krumholz H.M.
- U.S. Department of Health and Human Services