NGS Data Delivery & Storage Policy
NGS data is “Big Data”. Every sequencing run generates hundreds of gigabytes of data. Depending on the services you request from the GSF, you will get between several to hundreds of GB for every service project. It is your responsibility to get your data, keep it safe, and back up.
The GSF delivers demultiplexed FASTQ files (gzip compressed), containing only the reads from clusters passing the Illumina quality filter to users. By default, we do not trim the sequencing data. The FASTQ is a text-based sequence file format generated from the sequencer that stores raw sequence data and quality scores. FASTQ files have become the standard format for storing NGS data from Illumina sequencing systems and can be used as input for various secondary data analysis solutions.
FASTQ sequencing files provided by the Genome Sequencing Facility will be stored in GSF’s server for one year. The GSF uses an SFTP or Dropbox account for sequencing data delivery. The GSF’s FASTQ sequencing files will generally be available in the user’s account for three months. It is recommended that the investigators download and archive their sequencing results as soon as they receive their data link.
The GSF delivers the following two files with your NGS data download:
- md5 checksum – you can use this file to verify the integrity of your download
- fastq files – generally zipped or tar format to deliver the sequencing data as a single file
For a single-cell analysis project with 10X Genomics, the GSF delivers the following two files:
- Cellranger count output:
- We run cellranger count on all single-cell gene expression samples. Inside the top directory of your download is a directory for each sample by name that contains the results from the count step of the cellranger pipeline. The web_summary.html is likely what you want to look at first—the cloupe. Cloupe files can be opened in the loupe browser supplied by 10x. You will be able to find this file in the following path: [run_id]/Sample_[name]/outs
- Fastq’s:
- We use bcl2fastq2 to demultiplex all sequencing data. You will notice that each sample will have 4 fastq directories associated with it (one for each of the four barcodes in the 10x barcode set for each sample), named Sample_[number]_[letter].
Please contact us if you would like assistance interpreting the results produced by cellranger. We
will do our best to answer any questions, or we can guide you toward assistance and resources
provided by 10x.
For the NGS project requested with bioinformatics analysis, please contact bioinformatics director Dr.
Yidong Chen at cheny8@uthscsa.edu for analyzed data format and delivery.