The US National Tuberculosis Surveillance System: A Descriptive Assessment of the Completeness and Consistency of Data Reported from 2008 to 2012

Background: In 2009, the Tuberculosis (TB) Information Management System transitioned into the National TB Surveillance System to allow use of 4 different types of electronic reporting schemes: state-built, commercial, and 2 schemes developed by the Centers for Disease Control and Prevention. Simultaneously, the reporting form was revised to include additional data fields. Objective: Describe data completeness for the years 2008-2012 and determine the impact of surveillance changes. Methods: Data were categorized into subgroups and assessed for completeness (eg, the percentage of patients dead at diagnosis who had a date of death reported) and consistency (eg, the percentage of patients alive at diagnosis who erroneously had a date of death reported). Reporting jurisdictions were grouped to examine differences by reporting scheme. Results: Each year less than 1% of reported cases had missing information for country of origin, race, or ethnicity. Patients reported as dead at diagnosis had death date (a new data field) missing for 3.6% in 2009 and 4.4% in 2012. From 2010 to 2012, 313 cases (1%) reported as alive at diagnosis had a death date and all of these were reported through state-built or commercial systems. The completeness of reporting for guardian country of birth for pediatric patients (a new data field) ranged from 84% in 2009 to 88.2% in 2011. Conclusions: Despite major changes, completeness has remained high for most data elements in TB surveillance. However, some data fields introduced in 2009 remain incomplete; continued training is needed to improve national TB surveillance data. (JMIR Public Health Surveill 2015;1(2):e15) doi: 10.2196/publichealth.4991


Introduction
Tuberculosis (TB) incidence (or case notification) is used globally for monitoring trends, planning, and evaluating public health programs [1,2]. In the United States, national incidence reporting began in 1953, with documented cases and operational data from each reporting jurisdiction submitted in aggregate [3]. By 1985, all jurisdictions were reporting individual cases using a standardized form, the Report of Verified Case of Tuberculosis (RVCT) [4]. In 1993, the RVCT was expanded to include additional risk factors and laboratory information, and TB surveillance data began to be entered and transmitted to the Centers for Disease Control and Prevention (CDC) through a single software system [5].
The US National Tuberculosis Surveillance System (NTSS) underwent major revisions in 2009 [6]. RVCT was expanded to include 11 new data fields, and 25 of 38 existing fields were modified. Concurrently, state and local reporting areas transitioned from reporting TB case data through the Tuberculosis Information Management System (TIMS), a stand-alone, modem-based system developed at the CDC, to their choice of 4 reporting schemes: (1) the National Electronic Disease Surveillance System (NEDSS)-base system, a CDC-developed infrastructure; (2) the electronic RVCT (eRVCT), also developed by the CDC; (3) state-developed custom software systems; or (4) commercial software developed by private companies. All reporting schemes were required to conform to specific Public Health Information Network and NEDSS data standards [7,8].
The transition from a single reporting scheme to a choice of different types of schemes allowed state and local TB programs more control over the structure of their surveillance systems and gave them responsibility for their own data validation [9]. Prior to 2009, surveillance data came to the CDC via TIMS, which had a built-in data validation system for alerting logic errors to help ensure accurate data entry and reporting. These validation standards were retired with TIMS in 2010, although the CDC-developed eRVCT and NEDSS-base system retained validation rules similar to those in TIMS. Validation rules for state-developed and commercial schemes vary by jurisdiction. Furthermore, routine maintenance, updates, changes, and enhancements of state-developed and commercial reporting schemes are now at the expense of state and local TB programs; information technology (IT) expertise is necessary at the state and local level to maintain and update these types of systems [9]. Modifications of state and commercial reporting schemes, such as changes in RVCT data fields, have to be done at the level of the individual reporting jurisdiction; therefore, modifications to NTSS are more complicated than they were prior to 2009, when the CDC was able to update a single system and provide all reporting jurisdictions with updated software that incorporated the revisions.
The objectives of this report are to describe the completeness and consistency of TB case data reported to the CDC from 2008 to 2012, to determine the extent to which the 2009 changes in RVCT and reporting schemes affected the data, and to find ways to improve data quality. Although the surveillance report and the reporting schemes described here are specific to TB, the analytical methods and results may be useful to managers of other public health programs who are contemplating similar changes in surveillance systems or reporting schemes.

Data sources
NTSS receives TB surveillance data electronically from the 50 states and the District of Columbia [6]. The reporting officials in TB programs collect laboratory and clinical TB data from a variety of sources and store them in electronic reporting systems. From 1998 to 2009, those officials submitted TB surveillance data through TIMS by using file-transfer protocol and controlled-access Internet and modem transfer [10]. Starting in 2009, TB surveillance data have been transmitted using Public Health Information Network Messaging Service software in HL7 messaging format.
The CDC provides preliminary TB surveillance datasets weekly for reporting program officials to verify reported data. The CDC creates final TB surveillance datasets annually for reporting, research, and publications. Since 2009, TB data reported to the CDC have been subjected to a data-cleaning routine before a finalized dataset is created. The data cleaning routine is applied to selected data fields using a hierarchical strategy as determined by CDC staff (eg, a dependent field, such as the year of previous TB episode, is deleted if the independent field, such as history of previous TB, is not present) that creates a dataset that has fewer inconsistencies but not necessarily more accuracy. Our analysis included only clean, finalized annual datasets.

Analysis
We examined responses from NTSS data elements from 2008 to 2012 (the most recent year of data at the time of analysis) and new elements from 2009 to 2012. Although NTSS includes data from 1993 to 2012, the purpose of this study was to examine how the changes in data elements and reporting schemes affected the data; therefore, the study period begins the year before the changes occurred. New data elements from Alaska, California, Connecticut, Illinois, Missouri, Mississippi, North Carolina, North Dakota, New York City, and Ohio were not included for 2009 because these jurisdictions used TIMS that year and the new elements were not supported. In addition, we excluded California and Vermont from analyses that included HIV test results for 2008-2012 because HIV reporting practices were different for these jurisdictions.
Reporting jurisdictions were categorized according to the type of reporting scheme (TIMS, commercial, eRVCT, NEDSS-base, or state-developed) used in 2009 and 2010-2012. Because of the changes in both reporting schemes and RVCT in 2009, data from that year were examined separately from latter years' data.
Data were categorized into subgroups and data elements associated with subgroups were assessed for completeness (eg, the percentage of patients dead at diagnosis who had a date of death reported) and consistency (eg, the percentage of patients alive at diagnosis who erroneously had a date of death reported). The results are presented for a subset of data elements that are clinically or demographically important or exhibited inconsistency or incompleteness in reporting. Furthermore, for each TB case we selected key data elements from 3 different categories: risk factors, clinical aspects of TB disease, and molecular aspects of TB disease.

Results
From 2008 to 2012, 56,040 cases were reported to NTSS [6]. Each year, fewer than 1% of reported cases had missing or unknown information for origin of birth (nativity; 59/56,040), or race/ethnicity (197/56,040). One data element that demonstrated inconsistency in completeness was correctional facility status (residence in correctional facility at time of diagnosis), for which 6.5% of cases (746/11,520) had unknown or missing information in 2009, compared with approximately 1% or less of cases (265/44,529) in other years (Table 1). When correctional facility status was examined by reporting system (Table 2), information was missing for 17.1% (729/4266) of the cases reported by jurisdictions using TIMS in 2009, while the other reporting systems had less than 1% of cases (17/6871) missing for this element. Among cases reported as residents in correctional facilities at the time of diagnosis, information on the type of correctional facility was missing for 9% (10/110) of cases reported through state-developed reporting systems in 2009 and 2010-2012 (25/267), compared to less than 3% (17/1386) through TIMS, commercial, NEDSS-based, and eRVCT reporting systems for those same years (Tables 2 and  3).    For cases reported as dead at TB diagnosis, 4.4% (7/160) were missing date of death in 2009, the first year date of death information was collected, and 4.6% (8/221) were missing it in 2012 (Table 4). In 2009, 48 of 7094 TB cases (0.70%) were reported as alive at diagnosis and had a date of death indicated (Table 5). A majority of these (83%, 40/48; Table 2) were reported through state-developed systems. From 2010 to 2012, 313 of 30,875 TB cases (1%) were reported as alive at diagnosis and had a date of death indicated (Table 5); all were reported through state-developed or commercial reporting systems (Table  3).    Tables 2 and 3). Nonpediatric cases with primary guardian information were predominantly reported through state-developed software systems in 2009 (Table 2) and 2010-2012 (Table 3).

Principal Findings
Considering the extent of changes the US TB Surveillance System underwent in 2009, TB surveillance data have maintained a high level of completeness, with most data elements showing the same levels of completeness after 2009. New data elements, for which collection and reporting began in 2009 for most reporting jurisdictions, have varied completeness but show an overall improvement from 2009 to 2012. Some new data elements are taking longer to reach a high percentage of completeness at the state and local levels, or are less complete or less concordant in 2012 than they were in 2009. For example, patients who were dead at the time of TB diagnosis should have had a corresponding date of death recorded (the date-of-death data element was introduced in 2009). However, some jurisdictions reported a date of death for patients who were alive at diagnosis, which occurred more frequently in 2012 than in 2009 (Table 5). If a patient is alive at TB diagnosis and dies during therapy, there is no corresponding date of death field; therefore, some reporting jurisdictions may be recording the date of patient death in the field for death date of patients who were dead at the time of TB diagnosis. Among cases reported in 2009 that were alive at diagnosis and had a date of death recorded, 58% (28/48) had a date of death that matched the date therapy was stopped (data not shown), indicating that the date of death field was used to record the date of death during therapy. Completeness may also have been affected by lack of information or inability to find information in patient records, misinterpretation of data element definitions, or use of a paper reporting form that does not match the electronic reporting data entry form [2]. For some jurisdictions, electronic reporting systems may not have been revised to accommodate reporting of certain data elements; therefore, those elements cannot be reported electronically. Ongoing training of local staff to account for turnover and changes in duties may improve completeness of reporting [2].
The data cleaning routine does not take into consideration all possible data errors. Information requested specifically for all TB patients less than 15 years of age was sometimes reported for cases 15 years of age or older (Tables 2, 3, and 5), and the date of death may have been indicated for patients who were alive at diagnosis (Tables 2, 3, and 5); these discrepancies are not corrected as part of data cleaning. Therefore, care is warranted when working with NTSS data for reporting or research purposes. Proper subsetting is needed to prevent inclusion of patients who should not be included in a specific subset for analysis, such as patients alive at diagnosis when analyzing date of death, as these exclusions are not built into the dataset and omitting them could result in erroneous results.
Differences in completeness of data reported through the different electronic systems may be due to system configuration or reporting practices within the jurisdictions. The high percentage of missing correctional facility information reported in 2009 (Table 1) was due to data transmission problems experienced by a single reporting jurisdiction. The information for residence in a correctional facility existed in TIMS but was not transferred from TIMS to the jurisdiction's new reporting system. Furthermore, commercial and state-developed reporting systems are responsible for their own validation, which could account for some higher percentages of missing or inaccurate data. TB case surveillance data do not allow for assessment of systems or reporting practices at the state and local level, so it was not possible to distinguish between factors related to systems or reporting practices in this analysis.
In 2009 there was an unexpected and significant decline in the numbers of TB cases reported to NTSS compared to previous years [11]. Changes to electronic reporting systems were not deemed to be a causal factor. Rather, we concluded that the decline in TB cases was a result of decreased TB diagnoses in the United States. Therefore, we did not consider the unexpected decline in TB cases in 2009 to be a factor in our study.

Limitations
This study has several limitations. Limited resources prevented us from conducting a validation study at the local level to compare patient data from medical charts to the data reported to NTSS. This would have been especially valuable to assess data elements that exhibited inconsistency. The data-cleaning routine replaced some validation rules that existed in TIMS but may not have improved the quality of data reported to the CDC. For example, from 2009 to 2012, 2 cases reported as not having initial susceptibility testing done were also reported as susceptible to both isoniazid and rifampin (data not shown), indicating that initial drug susceptibility testing may actually have been done. Because the cases were reported as not undergoing susceptibility testing, the susceptibility results were deleted for these cases during data cleaning and therefore are not reflected in the clean, finalized dataset. Isoniazid and rifampin are important drugs for treating TB and resistance to both defines multidrug-resistant TB. If susceptibility testing was indeed done for isoniazid and rifampin, then drug susceptibility testing should be reported as "done" on RVCT.

Conclusion
Several ongoing efforts have been implemented to improve the quality of surveillance reporting. The CDC initiated a series of trainings in 2010 with the goal of familiarizing state and local reporting jurisdictions with the updated RVCT and reporting requirements [12]. Additionally, in 2011, the CDC conducted a series of trainings on quality assurance of TB data [13]. The trainings culminated in a published manual that is available to reporting jurisdictions and others interested in attaining high-quality surveillance data [14]. A collection of reports showing various aspects of TB data reported to the CDC is available through NTSS to authorized state and local TB program staff. Information provided through NTSS reports includes the numbers of missing and unknown values associated with reported data elements, the frequency of reporting for select elements, when data were last transmitted to the CDC, and a list of elements with no information ever reported for a particular reporting area. State and local TB program staff can use these reports to identify and correct gaps in reported data or to report data errors to the CDC. The National Tuberculosis Indicators Project (NTIP) can also be used to verify and check TB surveillance data reported to the CDC [13]. Reporting jurisdictions can compare their records with NTIP data and use the NTIP to identify discrepancies. The RVCT has an accompanying manual that provides comprehensive reporting guidance for each data element [15]. Furthermore, the RVCT workgroup, composed of CDC and state and local TB program staff, actively pursues clarification and provides guidance on improving RVCT reporting. As state and local TB control programs are often challenged with declining resources and staff turnover, the CDC should periodically provide updated quality assurance and RVCT training webinars and materials to ensure that TB control program staff remain aware of data problem areas and new and existing quality assurance tools and techniques. These efforts, as well as ongoing discussions regarding data quality assurance, will improve the completeness and accuracy of TB surveillance data.
State and local communicable disease surveillance systems vary from disease-specific systems to systems used for reporting an array of diseases and conditions [9]. However, from 2007 to 2010, interoperability and integration of state and local public health disease surveillance systems increased substantially [9]. As public health programs begin to utilize current advances in electronic reporting and embrace new national guidelines related to health information exchange and meaningful use, more electronic surveillance systems will be modified to increase capacity and meet national standards [9,16]. The results of the NTSS transition from a single, stand-alone surveillance system to a variety of different reporting schemes illustrate that major modifications of disease surveillance systems can be done without substantial impact on the completeness of surveillance data.