Trim fastq read records down to sgRNA sequences.
Author: Chet Birger;Broad Institute
Contact: birger@broadinstitute.org
Algorithm Version:
The CRISPR suite of GenePattern modules supports the computational processing of the data sets generated by CRISPR genome-scale functional screens.
In these screens, cells are transduced with a library of lentiCRISPR vectors, each vector carrying the DNA sequence for a particular sgRNA, which guides the Cas9 nuclease to a specific genomic location. The Cas9:sgRNA complex will generate a double stranded break (DSB) at the targeted locus and the cell's error prone DSB repair mechanisms will lead to a frame-shift indel and resulting loss-of-fuction mutation. Puromycin selection eliminates uninfected cells from the population. Following selection, DNA is extracted from the cell culture. The lentiCRISPR constructs integrated into infected cells' DNA are then amplified using PCR, and next generation sequencers produce FastQ files whose read records contain the read sequences associated with the transduced lentiCRISPR constructs. Through analysis of the read data, researchers can evaluate the representation of each sgRNA in the sequencing library, identifying selectively depleted or surviving sgRNAs in loss- or gain-of-function screens.
Profiles of sgRNA depletion or survival can be obtained with the following computational workflow:
We provide the following CRISPR GenePattern modules to support the above workflow:
GenePattern supports several short read aligners. At the time of writing this documentation, GenePattern modules were available for BWA, Bowtie1, and Bowtie2. Any of these aligner modules may be used in step 3 above. Each aligner has its own companion indexer module, required to generate an index of the reference FASTA to which the trimmed reads will be aligned.
In paired sgRNA CRISPR screens, two sgRNA sequences are positioned at opposite ends of the lentiCRISPR vector:
(5')<Vector subsequence>-<First sgRNA>-<Vector subsequence>-<Second sgRNA><Vector subsequence>(3')
TATATATCTTGTGGAAAGGACGAAACACCACCGGNNNNNNNNNNNNNNNNNNNNGTTTTAGAGCTAGAAATAGCAAGTTAA
AATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTGaattCAATTGCTGCAGGAACTACTA
TTTCCCATGATTCCTTCATATTTGCATTAGGTGGTTTATTGGCAAAACTCAAGCAAGATACCTGGCATATGTTAGGACTGC
TTTGTTTTTTCCTTATCTTGGGAAAGCATTCAACCTTTTATCATACATCCTTTCTGCGTATTCCTTTCTGTTCTTTAAAAA
TGTTAAACCATGCTTACCGTAACTTGAAAGTATTTCGATTTCTTGGCTTTATATATGGTGCTTGGTCTACATcaccNNNNN
NNNNNNNNNNNNNNNGTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACC
GAGTCGGTGCTTTTTTAAGCTTGGCGTAACTAGATCTTGAG
Paired-end sequencing reads the fragment, which in this case is the lentiCRISPR vector, from both 5' and 3' ends, producing forward and reverse reads. The forward reads will contain the 5' (leftmost) sgRNA sequence; reverse reads will contain the 3' (rightmost) sgRNA sequence. When quantifying the alignments, the pairings sgRNA sequences in each paired-end read record is tracked and the summarized counts are reported for each ordered pairing of sgRNA reference sequences.
The CRISPR.sgRNA_read_trimmer module trims a FASTQ file's read records, removing the vector subsequences upstream and downstream from the sgRNA and retaining the transduced vector's sgRNA sequence alone.
The single sgRNA lentiCRISPR vectors are designed such that a constant-valued prefix sequence is positioned just upstream of the sgRNA sequence. This prefix can be used to identify the start of the sgRNA sequence. The trimming algorithm first searches for the presence of the known prefix (specified by the prefix parameter). This prefix is typically 5-8 bp in length. The prefix search algorithm can tolerate a configurable number of mismatches (see prefix mismatches parameter). The shorter the prefix, the less tolerant one should be of mismatches. Once the prefix is located, the read record is trimmed to contain only the sgRNA sequence read, removing all base reads up to and including the prefix, and removing all base reads beyond the end of the known-length sgRNA sequence (see sgRNA length parameter).
The paired sgRNA lentiCRISPR vectors are designed with a constant-valued prefix sequence positioned just upstream (in the 5' direction) of the first sgRNA sequence, and a different constant-valued prefix sequence positioned just downstream (in the 3' direction) of the second sgRNA sequence. When trimming read records in the FASTQ file containing the reverse reads, the reverse reads parameter must be set to True. This instructs the module to search for the reverse complement of the provided prefix.
When conducting paired sgRNA CRISPR screens, our tool suite requires that read record pairing be maintained; i.e., the trimmed FASTQ files retain the one-to-one mapping of forward and reverse read records. Therefore, if the read trimmer is unable to locate the prefix, or if the number of base reads available after the located prefix is less than the stated length of the sgRNA (sgRNA length) this module will present the first sgRNA length bases of the forward or reverse read sequence as the trimmed read. This maintains the pairing, however the reported sequence will not align to any of the known reference sgRNA sequences.
http://www.genome-engineering.org/crispr/
Name | Description |
---|---|
reads file * | fastq or fastq.gz file |
prefix * | The nucleotide sequence that immediately precedes the sgRNAs. |
prefix mismatches * | Maximum number of mismatches tolerated when searching for prefix in read sequence. Must be <= 2. |
sgRNA length * | Length of sgRNA sequence. 20 in most cases. |
max num reads * | Maximum number of red records to process. |
output basename * | Basename for all output files. |
reverse reads | Set to True if extracting second sgRNA from a paired sgRNA CRISPR screen. |
* - required
This module is written in Python. The GenePattern server on which it is installed must have a custom configuration setting with name python_2.7 whose value is set to the path of a python 2.7 interpreter. The module's python code imports tools from the Biopython package, which must be installed on the server's host system, along with the python 2.7.
Task Type:
CRISPR
CPU Type:
any
Operating System:
any
Language:
Python
Version | Release Date | Description |
---|---|---|
8 |