use random forest to split the reads into nucleosome free, mononucleosome, dinucleosome and trinucleosome. The features used in random forest including fragment length, GC content, and UCSC phastCons conservation scores.

splitGAlignmentsByCut(
  obj,
  txs,
  genome,
  conservation,
  outPath,
  breaks = c(0, 100, 180, 247, 315, 473, 558, 615, Inf),
  labels = c("NucleosomeFree", "inter1", "mononucleosome", "inter2", "dinucleosome",
    "inter3", "trinucleosome", "others"),
  labelsOfNucleosomeFree = "NucleosomeFree",
  labelsOfMononucleosome = "mononucleosome",
  trainningSetPercentage = 0.15,
  cutoff = 0.8,
  halfSizeOfNucleosome = 80L
)

Arguments

obj

an object of GAlignments

txs

GRanges of transcripts

genome

an object of BSgenome

conservation

an object of GScores.

outPath

folder to save the splitted alignments. If outPath is setting, the return of the function will not contain seq and qual fields.

breaks

a numeric vector for fragment size of nucleosome free, mononucleosome, dinucleosome and trinucleosome. The breaks pre-defined here is following the description of Greenleaf's paper (see reference).

labels

a character vector for labels of the levels of the resulting category.

labelsOfNucleosomeFree, labelsOfMononucleosome

character(1). The label for nucleosome free and mononucleosome.

trainningSetPercentage

numeric(1) between 0 and 1. Percentage of trainning set from top coverage.

cutoff

numeric(1) between 0 and 1. cutoff value for prediction.

halfSizeOfNucleosome

numeric(1) or integer(1). Thre read length will be adjusted to half of the nucleosome size to enhance the signal-to-noise ratio.

Value

a list of GAlignments

References

Buenrostro, J.D., Giresi, P.G., Zaba, L.C., Chang, H.Y. and Greenleaf, W.J., 2013. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nature methods, 10(12), pp.1213-1218.

Chen, K., Xi, Y., Pan, X., Li, Z., Kaestner, K., Tyler, J., Dent, S., He, X. and Li, W., 2013. DANPOS: dynamic analysis of nucleosome position and occupancy by sequencing. Genome research, 23(2), pp.341-351.

Author

Jianhong Ou

Examples

library(GenomicRanges) bamfile <- system.file("extdata", "GL1.bam", package="ATACseqQC", mustWork=TRUE) tags <- c("AS", "XN", "XM", "XO", "XG", "NM", "MD", "YS", "YT") gal1 <- readBamFile(bamFile=bamfile, tag=tags, which=GRanges("chr1", IRanges(1, 1e6)), asMates=FALSE) names(gal1) <- mcols(gal1)$qname library(BSgenome.Hsapiens.UCSC.hg19) library(TxDb.Hsapiens.UCSC.hg19.knownGene) txs <- transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene) library(phastCons100way.UCSC.hg19)
#> Loading required package: GenomicScores
#> #> Attaching package: 'GenomicScores'
#> The following object is masked from 'package:utils': #> #> citation
splitGAlignmentsByCut(gal1, txs=txs, genome=Hsapiens, conservation=phastCons100way.UCSC.hg19)
#> GAlignmentsList object of length 8: #> $NucleosomeFree #> GAlignments object with 1196 alignments and 16 metadata columns: #> seqnames strand cigar qwidth start end width #> <Rle> <Rle> <character> <integer> <integer> <integer> <integer> #> 14 chr1 + 47M 47 565854 565900 47 #> 14 chr1 - 47M 47 565854 565900 47 #> 17 chr1 + 40M 40 565903 565942 40 #> 17 chr1 - 40M 40 565903 565942 40 #> 56 chr1 + 50M 50 566898 566947 50 #> ... ... ... ... ... ... ... ... #> 5083 chr1 + 50M 50 995025 995074 50 #> 5085 chr1 - 50M 50 995029 995078 50 #> 5083 chr1 - 50M 50 995042 995091 50 #> 5103 chr1 + 50M 50 995094 995143 50 #> 5103 chr1 - 50M 50 995097 995146 50 #> njunc | qname flag mapq isize #> <integer> | <character> <integer> <integer> <integer> #> 14 0 | 14 99 31 -47 #> 14 0 | 14 147 31 -47 #> 17 0 | 17 99 31 -40 #> 17 0 | 17 147 31 -40 #> 56 0 | 56 99 42 93 #> ... ... . ... ... ... ... #> 5083 0 | 5083 99 42 67 #> 5085 0 | 5085 83 42 -62 #> 5083 0 | 5083 147 42 -67 #> 5103 0 | 5103 163 42 53 #> 5103 0 | 5103 83 42 -53 #> seq qual mrnm AS #> <DNAStringSet> <PhredQuality> <factor> <integer> #> 14 CCCACCATCA...ACTTCTACCT CCCFFFFFHH...JJJJJJJJIJ chr1 0 #> 14 CCCACCATCA...ACTTCTACCT JJIHIJJJJJ...HHFFFDDCCB chr1 0 #> 17 GCCTAATCTA...TCCCTATATC CCCFFFFFHH...JJJJGJJIIJ chr1 0 #> 17 GCCTAATCTA...TCCCTATATC IJJJJJIIHE...FHFD?FDBBB chr1 0 #> 56 TCCTTACACC...TCAATTTCAT CCCFFFFFHH...JJJIIIJJJJ chr1 0 #> ... ... ... ... ... #> 5083 GCGTGGCCCG...CGGAGCCCAG @@@DDDFFHG...DBDBB5@DBB chr1 0 #> 5085 GGCCCGGGGC...GCCCAGGCAG DDDBDDDDFF...HHFFFFFCCB chr1 0 #> 5083 GCATCGGGGC...CTTCGCGGAG ?BDDBBB?6>...HHFDDFF@B@ chr1 0 #> 5103 GCCGGAGGGT...GCGAGCGGCG ??BDDDDFFF...########## chr1 0 #> 5103 GGAGGGTGCG...AGCGGCGCGG <@8@8575&B...:?8@8BD=?? chr1 -4 #> XN XM XO XG NM MD YS #> <integer> <integer> <integer> <integer> <integer> <character> <integer> #> 14 0 0 0 0 0 47 0 #> 14 0 0 0 0 0 47 0 #> 17 0 0 0 0 0 40 0 #> 17 0 0 0 0 0 40 0 #> 56 0 0 0 0 0 50 -3 #> ... ... ... ... ... ... ... ... #> 5083 0 0 0 0 0 50 0 #> 5085 0 0 0 0 0 50 0 #> 5083 0 0 0 0 0 50 0 #> 5103 0 0 0 0 0 50 -4 #> 5103 0 2 0 0 2 24A7C17 0 #> YT #> <character> #> 14 CP #> 14 CP #> 17 CP #> 17 CP #> 56 CP #> ... ... #> 5083 CP #> 5085 CP #> 5083 CP #> 5103 CP #> 5103 CP #> ------- #> seqinfo: 25 sequences from an unspecified genome #> #> ... #> <7 more elements>