HTAN Sequencing Data

HTAN supports multiple sequencing modalities including Single Cell and Single Nucleus RNA Seq (sc/snRNASeq), Single Cell ATAC Seq, Bulk RNA Seq and Bulk DNA Seq.

The HTAN standard for gene annotations is GENCODE Version 34. GENCODE is used for gene definitions by many consortia, including ENCODE, NCI Genomic Data Commons, Human Cell Atlas, and PCAWG (Pan-Cancer Analysis of Whole Genomes). Ensembl gene content is essentially identical to that of GENCODE (FAQ) and interconversion is possible.

HTAN has adopted the GENCODE 34 Gene Transfer Format (GTF) comprehensive gene annotation file (GENCODE 34 GTF) and filtered files (GENCODE 34 GTF with genes only; GENCODE 34 GTF with genes only and retaining only chromosome X copy of pseudoautosomal region) for HTAN gene annotation. Note that HTAN also includes data generated with other gene models, as the process of implementing the standard is ongoing. Within HTAN metadata files, the reference genome used can be found in the attribute “Genomic Reference” and “Genomic Reference URL”.

In alignment with The Cancer Genome Atlas and the NCI Genomic Data Commons, sequencing data are divided into four levels:

LevelDefinitionExample Data
1Raw dataFASTQs, unaligned BAMs
2Aligned primary dataAligned BAMs
3Derived biomolecular dataGene expression matrix files, VCFs, etc.
4Sample level summary data.t-SNE plot coordinates, etc.

Attributes

WARNING: Manifests provided on this page are for reference only. DO NOT USE THESE MANIFESTS FOR DATA SUBMISSION.

Directions

The interactive tables below are provided to help users understand the HTAN Data Model. The tables allow a user to view, search or download attributes either:

  1. in a specific manifest; or
  2. in all manifests represented on this page.

To view a specific manifest, click on the link in the Manifests tab. The manifest will appear in a new tab on the page. Navigate to the new tab to search for attributes or download the manifest.
To search for attributes among all manifests, navigate to the All Attributes tab and use the search box provided at the top of the tab. All attributes can also be downloaded as a csv file.

Manifest
Description
Single-cell RNA-seq [EFO_0008913]
Alignment workflows downstream of scRNA-seq Level 1
Gene and Isoform expression files
Data represents the relationships between cells derived from Level 3 expression data and shown as tSNE or UMAP coordinates per cell, plus all other cell-specific meta information (e.g., cell type)
Single-cell DNA-seq
Alignment workflows downstream of scDNA-seq Level 1
scATAC-seq files containing sequence read information, with or without alignment, as FASTQ or BAM files
scATAC-seq files containing aligned sequence data, as a BAM file
Processed data files containing peak information for cells
Data represents the relationships between cells derived from Level 3 expression data and shown as tSNE or UMAP coordinates per cell, plus all other cell-specific meta information (e.g., cell type)
Files contain raw scmC-seq data.
Files contain scmC-seq files containing aligned sequence data, as a BAM file.
Bulk Whole Exome Sequencing raw files
Bulk Whole Exome Sequencing aligned files and QC
Bulk Whole Exome Sequencing called variants
Raw data for bulk methylation sequencing, such as FASTQs and unaligned BAMs
Aligned primary data for bulk methylation sequencing, such as gene expression matrix files, VCFs, etc.
Sample level summary data for bulk methylation sequencing, such as t-SNE plot coordinates, etc.
Bulk RNA-seq [EFO_0003738]
Bulk RNA-seq alignment protocol description
Bulk RNA-seq gene expression matrices
Unaligned sequence data
Aligned read pairs, contact matrix
Summary data for the HI-C-seq assay.
Attribute
Manifest Name
Description
Required
Conditional If
Data Type
Valid Values
Filename
- scRNA-seq Level 1
- scRNA-seq Level 2
- scRNA-seq Level 3
- scRNA-seq Level 4
- scDNA-seq Level 1
- scDNA-seq Level 2
- scATAC-seq Level 1
- scATAC-seq Level 2
- scATAC-seq Level 3
- scATAC-seq Level 4
- scmC-seq Level 1
- scmC-seq Level 2
- Bulk DNA Level 1
- Bulk DNA Level 2
- Bulk DNA Level 3
- Bulk Methylation-seq Level 1
- Bulk Methylation-seq Level 2
- Bulk Methylation-seq Level 3
- Bulk RNA-seq Level 1
- Bulk RNA-seq Level 2
- Bulk RNA-seq Level 3
- HI-C-seq Level 1
- HI-C-seq Level 2
- HI-C-seq Level 3
Name of a file
True
String
File Format
- scRNA-seq Level 1
- scRNA-seq Level 2
- scRNA-seq Level 3
- scRNA-seq Level 4
- scDNA-seq Level 1
- scDNA-seq Level 2
- scATAC-seq Level 1
- scATAC-seq Level 2
- scATAC-seq Level 3
- scATAC-seq Level 4
- scmC-seq Level 1
- scmC-seq Level 2
- Bulk DNA Level 1
- Bulk DNA Level 2
- Bulk DNA Level 3
- Bulk Methylation-seq Level 1
- Bulk Methylation-seq Level 2
- Bulk Methylation-seq Level 3
- Bulk RNA-seq Level 1
- Bulk RNA-seq Level 2
- Bulk RNA-seq Level 3
- HI-C-seq Level 1
- HI-C-seq Level 2
- HI-C-seq Level 3
Format of a file (e.g. txt, csv, fastq, bam, etc.)
True
String
- hdf5
- bedgraph
- idx
- idat
- bam
- bai
- excel
- powerpoint
- tif
- tiff
- ome-tiff
- png
- doc
- pdf
- fasta
- fastq
- sam
- vcf
- bcf
- maf
- bed
- chp
- cel
- sif
- tsv
- csv
- txt
- plink
- bigwig
- wiggle
- gct
- bgzip
- zip
- seg
- html
- mov
- hyperlink
- svs
- md
- flagstat
- gtf
- raw
- msf
- rmd
- bed narrowpeak
- bed broadpeak
- bed gappedpeak
- avi
- pzfx
- fig
- xml
- tar
- r script
- abf
- bpm
- dat
- jpg
- locs
- sentrix descriptor file
- python script
- sav
- gzip
- sdf
- rdata
- hic
- ab1
- 7z
- gff3
- json
- sqlite
- svg
- sra
- recal
- tranches
- mtx
- tagalign
- dup
- dicom
- czi
- mex
- cloupe
- am
- cell am
- mpg
- m
- mzml
- scn
- dcc
- rcc
- pkc
- sf
- bedpe
HTAN Parent Biospecimen ID
- scRNA-seq Level 1
- scDNA-seq Level 1
- scATAC-seq Level 1
- scmC-seq Level 1
- Bulk DNA Level 1
- Bulk Methylation-seq Level 1
- Bulk RNA-seq Level 1
- HI-C-seq Level 1
- scRNA-seq Level 2
HTAN Biospecimen Identifier (eg HTANx_yyy_zzz) indicating the biospecimen(s) from which these files were derived; multiple parent biospecimen should be comma-separated
True
- Is lowest level is "Yes - Is lowest level"
String
HTAN Data File ID
- scRNA-seq Level 1
- scRNA-seq Level 2
- scRNA-seq Level 3
- scRNA-seq Level 4
- scDNA-seq Level 1
- scDNA-seq Level 2
- scATAC-seq Level 1
- scATAC-seq Level 2
- scATAC-seq Level 3
- scATAC-seq Level 4
- scmC-seq Level 1
- scmC-seq Level 2
- Bulk DNA Level 1
- Bulk DNA Level 2
- Bulk DNA Level 3
- Bulk Methylation-seq Level 1
- Bulk Methylation-seq Level 2
- Bulk Methylation-seq Level 3
- Bulk RNA-seq Level 1
- Bulk RNA-seq Level 2
- Bulk RNA-seq Level 3
- HI-C-seq Level 1
- HI-C-seq Level 2
- HI-C-seq Level 3
Self-identifier for this data file - HTAN ID of this file HTAN ID SOP (eg HTANx_yyy_zzz)
True
String
Nucleic Acid Source
- scRNA-seq Level 1
- scDNA-seq Level 1
- scATAC-seq Level 1
- scmC-seq Level 1
- Bulk Methylation-seq Level 1
- Bulk RNA-seq Level 1
- HI-C-seq Level 1
The source of the input nucleic molecule
True
String
- single cell
- bulk whole cell
- single nucleus
- bulk nuclei
- micro-region
Cryopreserved Cells in Sample
- scRNA-seq Level 1
Indicate if library preparation was based on revived frozen cells.
True
String
- yes
- no
Single Cell Isolation Method
- scRNA-seq Level 1
- scATAC-seq Level 1
- scmC-seq Level 1
The method by which cells are isolated into individual reaction containers at a single cell resolution (e.g. wells, micro-droplets)
True
String
- microfluidics chip
- droplets
- facs
- plates
- 10x
- nuclei isolation
Dissociation Method
- scRNA-seq Level 1
- scATAC-seq Level 1
The tissue dissociation method used for scRNASeq or scATAC-seq assays
True
String
- gentlemacs
- dounce
- enzymatic digestion
- not applicable
Library Construction Method
- scRNA-seq Level 1
- scATAC-seq Level 1
Process which results in the creation of a library from fragments of DNA using cloning vectors or oligonucleotides with the role of adaptors [OBI_0000711]
True
String
- smart-seq2
- smart-seqv4
- 10xv1.0
- 10xv1.1
- 10xv2
- 10xv3
- 10xv3.1
- cel-seq2
- drop-seq
- indropsv2
- indropsv3
- trudrop
- sci-atac-seq
- nextera xt
- 10x multiome
- 10x flex
- 10x gem 3'
- 10x gem 5'
Read Indicator
- scRNA-seq Level 1
- Bulk DNA Level 1
- Bulk RNA-seq Level 1
Indicate if this is Read 1 (R1), Read 2 (R2), Index Reads (I1), or Other
True
String
- r1
- r2
- r1&r2
- i1
- other
Read1
- scRNA-seq Level 1
Read 1 content description
True
String
- cell barcode and umi
- cdna
Read2
- scRNA-seq Level 1
Read 2 content description
True
String
- cell barcode and umi
- cdna
End Bias
- scRNA-seq Level 1
The end of the cDNA molecule that is preferentially sequenced, e.g. 3/5 prime tag/end or the full length transcript
True
String
- 3 prime
- 5 prime
- full length transcript
Reverse Transcription Primer
- scRNA-seq Level 1
An oligo to which new deoxyribonucleotides can be added by DNA polymerase [SO_0000112]. The type of primer used for reverse transcription, e.g. oligo-dT or random primer. This allows users to identify content of the cDNA library input e.g. enriched for mRNA
True
String
- oligo-dt
- poly-dt
- feature barcoding
- random
Spike In
- scRNA-seq Level 1
- Bulk RNA-seq Level 1
A set of known synthetic RNA molecules with known sequence that are added to the cell lysis mix
True
String
- ercc
- other spike in
- no spike in
- phix
Sequencing Platform
- scRNA-seq Level 1
- scATAC-seq Level 1
- scmC-seq Level 1
- Bulk DNA Level 1
- Bulk Methylation-seq Level 1
- Bulk RNA-seq Level 1
- HI-C-seq Level 1
A platform is an object aggregate that is the set of instruments and software needed to perform a process [OBI_0000050]. Specific model of the sequencing instrument.
True
String
- illumina next seq 500
- illumina next seq 550
- illumina next seq 2500
- illumina novaseq 6000
- illumina miseq
- 454 gs flx titanium
- ab solid 4
- ab solid 2
- ab solid 3
- complete genomics
- illumina hiseq x ten
- illumina hiseq x five
- illumina genome analyzer ii
- illumina genome analyzer iix
- illumina hiseq 2000
- illumina hiseq 2500
- illumina hiseq 4000
- illumina nextseq
- ion torrent pgm
- ion torrent proton
- ion torrent s5
- pacbio rs
- novaseq 6000
- novaseqs4
- ultima genomics ug100
- oxford nanopore minion
- gridion
- promethion
- pacbio sequel2
- revio
- illumina nextseq 1000
- illumina nextseq 2000
- other
- unknown
- not reported
Total Number of Input Cells
- scRNA-seq Level 1
Number of cells loaded/placed on plates
True
String
Input Cells and Nuclei
- scRNA-seq Level 1
Number of cells and number of nuclei input; entry format: number, number
True
String
Library Preparation Days from Index
- scRNA-seq Level 1
- Bulk DNA Level 1
- Bulk RNA-seq Level 1
Number of days between sample for assay was received in lab and the libraries were prepared for sequencing [number]. If not applicable please enter 'Not Applicable'
False
String
Single Cell Dissociation Days from Index
- scRNA-seq Level 1
Number of days between sample for single cell assay was received in lab and when the sample was dissociated and cells were isolated [number]. If not applicable please enter 'Not Applicable'
True
String
Sequencing Library Construction Days from Index
- scRNA-seq Level 1
Number of days between sample for assay was received in lab and day of sequencing library construction [number]. If not applicable please enter 'Not Applicable'
True
String
Nucleic Acid Capture Days from Index
- scRNA-seq Level 1
Number of days between sample for single cell assay was received in lab and day of nucleic acid capture part of library construction (in number of days since sample received in lab) [number]. If not applicable please enter 'Not Applicable'
True
String
Technical Replicate Group
- scRNA-seq Level 1
- scATAC-seq Level 1
- scmC-seq Level 1
- HI-C-seq Level 1
A common term for all files belonging to the same cell or library. Provide a numbering of each library prep batch (can differ from encapsulation and sequencing batch)
False
String
Empty Well Barcode
- scRNA-seq Level 1
Unique cell barcode assigned to empty cells used as controls in CEL-seq2 assays.
True
- Library Construction Method is "CEL-seq2"
String
Well Index
- scRNA-seq Level 1
Indicate if protein expression (EPCAM/CD45) positive/negative data is available for each cell in CEL-seq2 assays
False
- Library Construction Method is "CEL-seq2"
String
- yes
- no
Feature Reference Id
- scRNA-seq Level 1
Unique ID for this feature. Must not contain whitespace, quote or comma characters. Each ID must be unique and must not collide with a gene identifier from the transcriptome [https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/feature-bc-analysis#feature-ref]
True
- Reverse Transcription Primer is "Feature barcoding"
String
UMI Barcode Offset
- scRNA-seq Level 1
Start position of UMI barcode in the sequence. Values: number, 0 for start of read
True
- Spatial Read2 is "Spatial Barcode and UMI"
String
UMI Barcode Length
- scRNA-seq Level 1
Length of UMI barcode read (in bp): number
True
- Spatial Read2 is "Spatial Barcode and UMI"
String
Cell Barcode and UMI
- scRNA-seq Level 1
- scmC-seq Level 1
Cell and transcript identifiers
False
String
Median UMIs per Cell Number
- scRNA-seq Level 1
Number
True
- Read2 is "Cell Barcode and UMI"
String
Cell Barcode Offset
- scRNA-seq Level 1
Offset in sequence for cell barcode read (in bp): number
True
- Read2 is "Cell Barcode and UMI"
String
Cell Barcode Length
- scRNA-seq Level 1
Length of cell barcode read (in bp): number
True
- Read2 is "Cell Barcode and UMI"
String
Valid Barcodes Cell Number
- scRNA-seq Level 1
Number
True
- Read2 is "Cell Barcode and UMI"
String
CEL-seq2
- scRNA-seq Level 1
Highly-multiplexed plate-based single-cell RNA-Seq assay
False
String
Feature barcoding
- scRNA-seq Level 1
A method for adding extra channels of information to cells by running single-cell gene expression in parallel with other assays [https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/feature-bc]
False
String
HTAN Parent Data File ID
- scRNA-seq Level 2
- scRNA-seq Level 3
- scRNA-seq Level 4
- scDNA-seq Level 2
- scATAC-seq Level 2
- scATAC-seq Level 3
- scATAC-seq Level 4
- scmC-seq Level 2
- Bulk DNA Level 2
- Bulk DNA Level 3
- Bulk Methylation-seq Level 2
- Bulk Methylation-seq Level 3
- Bulk RNA-seq Level 2
- Bulk RNA-seq Level 3
- HI-C-seq Level 2
- HI-C-seq Level 3
HTAN Data File Identifier indicating the file(s) from which these files were derived
True
String
scRNAseq Workflow Type
- scRNA-seq Level 2
- scRNA-seq Level 3
- scRNA-seq Level 4
Generic name for the workflow used to analyze a data set.
True
String
- cellranger
- starsolo
- hca optimus
- dropest
- seqc
- cufflinks
- dexseq
- htseq - fpkm
- cell annotation
- differentiation trajectory analysis
- other
Workflow Version
- scRNA-seq Level 2
- scRNA-seq Level 3
- scRNA-seq Level 4
- scATAC-seq Level 4
Major version of the workflow (e.g. Cell Ranger v3.1)
True
String
scRNAseq Workflow Parameters Description
- scRNA-seq Level 2
- scRNA-seq Level 3
- scRNA-seq Level 4
Parameters used to run the workflow. scRNA-seq level 3: e.g. Normalization and log transformation, ran empty drops or doublet detection, used filter on # genes/cell, etc. scRNA-seq Level 4: dimensionality reduction with PCA and 50 components, nearest-neighbor graph with k = 20 and Leiden clustering with resolution = 1, UMAP visualization using 50 PCA components, marker genes used to annotate cell types, information about droplet matrix (all barcodes) to cell matrix (only informative barcodes representing real cells) conversion
True
String
Genomic Reference
- scRNA-seq Level 2
- scDNA-seq Level 2
- scATAC-seq Level 2
- scmC-seq Level 2
- Bulk DNA Level 2
- Bulk DNA Level 3
- Bulk RNA-seq Level 2
- HI-C-seq Level 1
- HI-C-seq Level 2
- HI-C-seq Level 3
- Bulk RNA-seq Level 3
Exact version of the human genome reference used in the alignment of reads (e.g. GCF_000001405.39)
True
- Pseudo Alignment Used is "Yes - Pseudo Alignment Used"
String
Genomic Reference URL
- scRNA-seq Level 2
- scDNA-seq Level 2
- scATAC-seq Level 2
- scmC-seq Level 2
- Bulk DNA Level 2
- Bulk DNA Level 3
- Bulk Methylation-seq Level 2
- Bulk RNA-seq Level 2
- Bulk RNA-seq Level 3
Link to human genome sequence (e.g. ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/GRCh38.primary_assembly.genome.fa.gz)
True
- Pseudo Alignment Used is "Yes - Pseudo Alignment Used"
String
Genome Annotation URL
- scRNA-seq Level 2
Link to the human genome annotation (GTF) file (e.g. ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/gencode.v34.annotation.gtf.gz)
True
String
Checksum
- scRNA-seq Level 2
MD5 checksum of the BAM file
True
String
Cell Barcode Tag
- scRNA-seq Level 2
SAM tag for cell barcode field; please provide a valid cell barcode tag (e.g. CB:Z)
True
String
UMI Tag
- scRNA-seq Level 2
SAM tag for the UMI field; please provide a valid UB, UMI (e.g. UB:Z or UR:Z)
True
String
Applied Hard Trimming
- scRNA-seq Level 2
Was Hard Trimming applied
True
String
- yes - applied hard trimming
- no
Is lowest level
- scRNA-seq Level 2
- Bulk DNA Level 2
- Bulk RNA-seq Level 2
Denotes that the manifest represents the lowest data level submitted. Use when L1 data is missing
False
String
- yes - is lowest level
- no