Availability of secondary healthcare data for conducting pharmacoepidemiology studies in Colombia: A systematic review

Abstract Real‐world evidence (RWE) is emerging as a fundamental component of the post‐marketing evaluation of medicinal products. Even though the focus on RWE studies has increased in Colombia, the availability of secondary data sources to perform this type of research is not well documented. Thus, we aimed at identifying and characterizing secondary data sources available in Colombia. We performed a systematic literature review on PubMed, EMBASE, and VHL using a combination of controlled vocabulary and keywords for the concepts of electronic health records, epidemiologic studies and Colombia. A total of 323 publications were included. These comprised 123 identified secondary data sources including pharmacy dispensing databases, government datasets, disease registries, insurance databases, and electronic heath records, among others. These data sources were mostly used for cross‐sectional studies focused on disease epidemiology in a specific population. Almost all databases (95%) contained demographic information, followed by pharmacological treatment (44%) and diagnostic tests (39%). Even though the database owner was identifiable in 94%, access information was only available in 44% of the articles. Only a pharmacy‐dispensing database, local cancer registries, and government databases included a description regarding the quality of the information available. The diversity of databases identified shows that Colombia has a high potential to continue enhancing its RWE strategy. Greater efforts are required to improve data quality and accessibility. The linkage between databases will expand data pooling and integration to boost the translational potential of RWE.

data sources has multiple applications on therapeutics, including drug utilization studies, 2,3 post-authorization safety studies, 4 comparative effectiveness, 5 and cost of interventions. 6 Initiatives related with RWE across the world have highlighted its role as a valuable complement to the evidence generated in randomized controlled trials (RCTs). 7 United States (US) and Europe have the highest number of healthcare-related data sources, probably due to the structure of healthcare systems and the legal framework which facilitates the electronic collection of routine clinical care. 7,8 In Latin America, there is an increasing demand for effective and innovative treatments that drives the requirement for the continuous evaluation of their safety and effectiveness. 9 A previous assessment of the status of RWE in Latin America has shown that, even though there are patient-level data resources available, their quality varies and are locally managed without standardizing coding and practices. 10 In Colombia, the situation is very similar. Even though there are government-led information systems and a single-payer healthcare system, the collection of data is not always complete. 10,11 The Colombian health system includes a social security system with public funding and a less preeminent private sector. The affiliation to a healthcare system is mandatory and is done through the health insurance companies (EPS-entidades promotoras de la salud, in Spanish) which manage the healthcare provision given by the healthcare settings. The goal of the Colombian healthcare system is to provide healthcare access to the entire of its population through a list of regimes that cover workers, low-income population, and special populations such as armed forces. 12,13 Hence, the system management in Colombia is somewhat centralized although healthcare provision is scattered in various public and private entities. In order to improve and enhance the use of RWE in Colombia, it is important to have more information on the resources that could be used for such purposes. Hence, the objective of this study was to perform a systematic literature review to identify and describe the characteristics of the secondary healthcare data sources available in Colombia that have been used to date for conducting overall epidemiology and pharmacoepidemiology studies.

| Search strategy
A systematic literature review was carried out following the PRISMA guidelines. 14 A comprehensive search strategy for peerreviewed articles was performed in three databases: PubMed, EMBASE, and Virtual Health Library, which includes Latin American sources. The search was conducted by the authors on December 2018. A combination of controlled vocabulary and keywords was used for the concepts of electronic health records (EHR), epidemiologic studies and Colombia; the Boolean operators "AND" and "OR" were used to combine these concepts (Supplementary Material 1). Given that, to the best of our knowledge, this is the first review of these characteristics conducted in Colombia, the search was not limited by publication date or language other than the date when we conducted the search: December 2018. All citations were imported into a citation management system and duplicates were removed.

| Eligibility criteria
Analytical studies performed on secondary data sources originating from Colombia were considered for inclusion into the study; these include pharmacoepidemiological studies, pharmacoeconomic studies, and safety studies, among others. Secondary data sources were considered as articles that analyzed data already collected for other purposes, including healthcare records, administrative and commercial databases, and disease and drug registries. Finally, databases covering several countries and multi-databases were also considered if they included information from Colombian patients.
The following articles were excluded from the review: studies performed under a "primary data collection" approach, defined as collection of data specifically for a particular study, 15 studies that focused on single-patient information (eg case reports, case series), pharmacoeconomic models, review articles, policy-related articles, and studies not involving Colombian data.
The authors performed the data extraction on Microsoft Excel.
In order to increase data quality, all the non-free text cells were blocked for data entry to decrease the likelihood of entry errors and a subsequent quality assurance was performed by reviewing a sample of 50% of the extraction.

| RE SULTS
The PubMed, EMBASE, and VHL searches yielded a total of 1294 publications. Of these, 159 corresponded to duplicates, and thus, the titles of 1135 articles were screened. Among these, 351 articles were excluded after title screening and the abstract of the 784 remaining papers was assessed. Finally, 461 publications were excluded following the eligibility criteria and thus, 323 articles that had interpretable data and fulfilled the eligibility criteria were used for data extraction 20-342 ( Figure 1).
Until December 2018, 323 publications reported using secondary data sources to conduct epidemiology studies in Colombia.  Since we had no limits set with regards to publication date, the first study identified in this SLR was published in 1967 and corresponds to a mortality study in children from Cali based on the death certificates of the local vital statistics office. 92 The number of publications remained low until the 2010s, after which a total of 296 articles were published, corresponding to approximately 92% of the articles identified. Two articles were accepted for publication in 2018 but the journal published them in 2019 ( Figure 2). Most publications were full-text articles (71%), and 29% posters, mainly from scientific conferences. Only 34 articles (10%) included information of additional countries besides Colombia. 23,50,54,59,61,62,80-83,85,86,159,160,169,178,181,198,206,241,260, 265,267,27 8,279,292,304,305,307,326,331,333,334,337 The majority of the studies were cross-sectional (61.6%), followed by cohort studies (11.5%). The main objective was often related with the description of disease epidemiology (70%), followed by drug utilization studies (20%). The proportion of drug effectiveness and safety studies was low, accounting for 3% and 6%, respectively.
A total of 123 databases were identified as the data source (Supplementary Material 2). The most frequently used database was from a pharmacy dispensing company, in 52 publications. This was followed by government databases (vital statistics in 29 publications and Ministry of Health datasets in 15). Finally, a local oncology disease registry was used in seven (7) publications. The main characteristics of the 123 databases identified are described in Table 1. The most common therapeutic areas were infections and neoplasms, both in 12%, followed by cardiac and neurologic disorders in 8% and endocrine disorders in 6% (Supplementary Material 3). Furthermore, even though the database owner was identifiable in 94%, access information and/or requirements were only available in 44% of the articles.

| Global and regional data sources
A total of 18 databases with global or Latin American regional scope were identified (Table 2), including the World Health Organization's mortality database and the International Agency for Research on Cancer (IARC) report, 82,159,267,278,279 as well as several Latin American disease registries and healthcare intervention data sets.

| Government Databases
The government databases were secondary data sources commonly used in the articles identified. A total of 88 articles (27%) reported data from a government agency. The most frequently used data source was the vital statistics information, mainly on mortality, from the National Administrative Department of Statistics (DANE). 28,30,31,38,39,45,66,69,71,74,89,93,165,171,199,200,210,221,225,228,229,252,262,268,269,274,287,303,317,321,339 The DANE collects the information through births and death integrates more than 10 primary sources of health-related information in a single query system. 38,39,200,266,344 Access to SISPRO requires a client access server, which allows the information to be on a local computer. Within this information system, the RIPS (Individual Registry of Health Services), contains data on age, gender, and medical diagnosis by ICD-10 for patients treated by the health system (public and private providers). 64,79,87,147,154,170,197,216,217,219,229,262,275,314,338 The High Cost Account (CAC), which was created by the government in 2007 and is administered by insurers, collects healthcare data on high-cost diseases (eg cancer, diabetes, hemophilia, rheumatic diseases, etc). 21,22,132,212,234,243,248,250,251,277 This database has a quality control process (audit through a validation mesh and verification against the medical record) and contains demographic data, diagnosis by ICD-10 and prescriptions.
Regarding health insurance companies, a total of 27 articles reported the use of claims data as the source of data. 26 In most of the articles (75%), the specific health insurance company was not identifiable.

| Local Disease and Drug Registries
There were 19 local disease and drug registries identified in 36 publications ( Table 3). The most commonly used local registry was the population-based cancer registry (PBCR) in Cali. 36 The collection of the data is performed through active search and notification from hospitals, public and private laboratories, and the DANE. This database contains patient's demographic information and clinical diagnosis by ICD-10, used to assess cancer morbidity and mortality. Additional PBCRs were created in Bucaramanga, Manizales, Pasto, and Barranquilla, with a coverage of 12% of the population. 166,318,319 Furthermore, the Cali, Bucaramanga, Manizales, and Pasto PBCRs follow IARCs standards and are sources of their worldwide cancer incidence report.

F I G U R E 1 Flowchart of article selection. VHL: virtual health library
In addition to the oncology databases, there were registries identified in diseases including heart failure (RECODEC 290,291 and ROCI 308 registries), trauma, 180

| Commercial and Miscellaneous Databases
IMS health was among the commercial databases identified in the SLR with a total of two (2) publications involving Colombian patient's data 206,260 on drug utilization using retail prescription sales data. Among the miscellaneous data sources (ie databases not fitting any of the earlier sections), we found the Healing the Children (HTC) organization database that include data on patients with cleft disease, 143 ; a database on genetic diseases 29

TA B L E 2 (Continued)
database on primary central nervous system tumors from pathology reports. 213 Government databases were also commonly used as secondary data sources. For many decades, the country has been implementing information systems that allows the government to capture diverse health-related data at a population level. 229 The access to these databases is readily available and includes a great amount of patient-level data. 347 Taking into account that in Colombia the access to healthcare is universal, the entire population is covered by the data collection and non-participation should be minimal. 266 However, those without access to care or who failed to encounter the health system (eg geographical location) will not be captured.

| D ISCUSS I ON
Additionally, being a passive reporting system, these databases relies on the appropriate reporting behaviors and thus, there is a risk for underreporting. 322,348 In contrast to passive government databases, disease registries are based on active finding of all new cases of disease from a well-defined demographic area, which could improve the data reliability. 322 In Colombia, several local disease registries were identified, mainly in oncology. 36,318 While these registries currently focus on cancer incidence and mortality, it would be ideal if in the future they include data on cancer treatment to expand the scope of the endpoints analyzed.
The main challenge with secondary data sources is that many of the available data currently rely in separate silos. 349 The absence of shared identifiers between the different types of databases prevents information linkage among heterogeneous data sources, 346,349 which can be attributable not only to technical difficulties, but also to privacy concerns. 346,349 Moreover interpretations of results coming from studies using secondary data sources will continue to encounter distrust, mainly due to data quality that could incorporate bias originating from confounding, missing data and misclassification. 345,346,349 A knowledge gap related to data quality was identified in this study, since only a pharmacy dispensing company, the PBCRs and the government databases included a description of quality control measures or of the overall quality of the information available.
The present SLR had some limitations that need to be accounted for when interpreting its results. First, the analyses were made at the article-level and not the database-level, given the methodology of the SLR. This means that only the data sources  strategy was not able to capture earlier studies that did not properly described the data source used, which could lead to an underrepresentation of some databases. Nonetheless, this could also be a representation of an actual increase in research productivity using secondary data sources, following the trend observed in other relevant countries (eg US).

| CON CLUS IONS
With the increasing access and use of these data sources, it is crucial that the evidence generated is made publicly available and that access to the data is granted to a larger research community, to the extent that this is possible, and assured through governance processes and ethics standards. A greater focus on expanding the use of these databases is required to increase their visibility in order to boost the translational potential of RWE in Colombia. The governance process to access most of the databases identified was poorly described or not described at all. Moreover replicating this SLR in other Latin American countries would contribute to the exploration of the status-quo of RWE in the region that is required to further define and describe the databases available for pharmacoepidemiology research.

ACK N OWLED G EM ENTS
This work was supported by Bayer. The funder (Bayer AG) had no role in the study design, the collection, analysis and interpretation of data, the writing of the report or the decision to submit the arti-

CO N FLI C T O F I NTE R E S T
Juan-Sebastian Franco and David Vizcaya are full-time employees of Bayer Colombia and Hispania (Spain), respectively.

AUTH O R CO NTR I B UTI O N S
All authors conceptualized, designed the study, analyzed the data, interpreted the data, reviewed and revised the manuscript. Dr Franco drafted the initial manuscript. All authors approved the final manuscript as submitted and agree to be accountable for all aspects of the work.

E TH I C S TATEM ENT
The authors state that no ethical approval was needed.

DATA AVA I L A B I L I T Y S TAT E M E N T
Research data are not available for sharing. Abbreviations: MedDRA, Medical Dictionary for Regulatory Activities; SOC, system-organ classification; ICD-10, International Statistical Classification of Diseases and Related Health Problems version 10.