Introduction

Clinical data is one of the most important metadata that can accompany cancer genomic data. HIPAA-limited clinical data is collected across the HTAN atlases where all Protected Health Information (PHI) of the participant is removed. This includes participant names, geographical identifiers smaller than a state, contact information, medical record numbers, all elements of dates (other than year) related to the participant including birthdate, date of death, encounter dates before submission of data to the DCC.

Tiered Approach for Clinical Data Collection

HTAN Clinical Data Collection Model consists a four tiered approach for collection of clinical data across the HTAN centers. Clinical metadata collected in association with each tier will enable downstream analysis and enhance data discovery.

  • Tier 1 – The base tier will be based on the Genomic Data Commons (GDC) clinical data model.
  • Tier 2 – will cover disease-agnostic extensions to the GDC.  
  • Tier 3 – will  cover disease-specific extensions to the GDC.
  • Tier 4 – will include any atlas specific clinical data elements recognized as a requirement by each center and not captured in Tiers 1 to 3.

Tier 1 Clinical Data Model and Metadata Attributes

Clinical data collection for Tier 1 data elements provides as much clinical data as possible across the GDC clinical categories, which is supported and maintained by the NCI. The GDC clinical data model currently covers seven categories of data, including Demographics, Diagnosis, Exposure, Treatment, Follow-up, Molecular Test and Family History. 

The approach used for the clinical data collection for Tier 1 GDC attributes is to collect as much optional data elements as possible while GDC defined required and preferred data elements (GDC version December 2019) are made absolute data requirements across HTAN centers. This provides a baseline for semantic information that is consistent across all centers and tumor types.

Tier 1 Clinical Data Categories

Tier 1 clinical data elements follow the GDC clinical data model, that groups them in seven categories. A brief description of each category below (per GDC):

Demographics – Data for the characterization of the patient by means of segmenting the population (e.g., characterization by age, sex, or race).

Diagnosis – Data from investigation, analysis and recognition of the presence and nature of disease, condition or injury from expressed signs and symptoms; data from scientific investigation of any kind and results of investigations.

Exposure – Clinically relevant patient information not immediately resulting from genetic predispositions.

Family History –  Record of a patient’s background regarding cancer events of blood relatives.

Follow-Up –  Information about the health status of an individual is obtained before and after a study has closed; an activity that continues something that has already begun or that repeats something that has already been done. A clinical encounter that includes planned and unplanned trial interventions, procedures and assessments that may be performed on a subject (includes a visit by a participant to a medical professional that has a start and an end, each described with a rule).

Molecular Test – Information pertaining to any molecular tests performed on the patient during a clinical event. This category captures data related to any molecular tests that are completed clinically.  For example, if a diagnostic array was completed prior to  research sequencing was done (data from the submitting study), the data related to such independent clinical tests of the participant is captured in this category.

Treatment – Record of the administration and intention of therapeutic agents provided to a patient to alter the course of a pathologic process.