Title: | Analysis of Adaptive Immune Receptor Repertoire Germ Line Statistics |
---|---|
Description: | Multiple tools are now available for inferring the personalised germ line set from an adaptive immune receptor repertoire. Output from these tools is converted to a single format and supplemented with rich data such as usage and characterisation of 'novel' germ line alleles. This data can be particularly useful when considering the validity of novel inferences. Use of the analysis provided is described in <doi:10.3389/fimmu.2019.00435>. |
Authors: | William Lees [aut, cre] |
Maintainer: | William Lees <[email protected]> |
License: | CC BY-SA 4.0 |
Version: | 0.5.1 |
Built: | 2024-11-02 15:57:46 UTC |
Source: | https://github.com/airr-community/ogrdbstats |
A small example of the analytical datasets created by ogrdbstats from repertoires and reference sets. The dataset can be created by running the example shown for the function read_input_data(). The dataset is created from example files provided with the package. The repertoire data is taken from Rubelt et al. 2016, <doi: 10.1038/ncomms11112>
example_rep
example_rep
## 'example_rep' - a named list containing the following elements:
ref_genes | named list of IMGT-gapped reference genes |
inferred_seqs | named list of IMGT-gapped inferred (novel) sequences. |
input_sequences | data frame with one row per annotated read, with CHANGEO-style column names. The column SEG_CALL is the gene call for the segment under analysis. Hence if segment is 'V', 'V_CALL' will be renamed 'SEG_CALL' whereas is segment is 'J', 'J_CALL' is renamed 'SEG_CALL'. This simplifies downstream processing. Rows in the input file with ambiguous SEG_CALLs, or no call, are removed. |
genotype_db | named list of gene sequences referenced in the annotated reads (both reference and novel sequences) |
haplo_details | data used for haplotype analysis, showing allelic ratios calculated with various potential haplotyping genes |
genotype | data frame containing information provided in the OGRDB genotype csv file |
calculated_NC | a boolean that is TRUE if mutation counts were calculated by this library, FALSE if they were read from the annotated read file |
<doi: 10.1038/ncomms11112>
This creates the genotype report (suffixed _ogrdb_report.csv) and the plot file (suffixed _ogrdb_plos.pdf). Both are created in the directory holding the annotated read file, and the file names are prefixed by the name of the annotated read file.
generate_ogrdb_report( ref_filename, inferred_filename, species, filename, chain, hap_gene, segment, chain_type, plot_unmutated, all_inferred = FALSE, format = "pdf" )
generate_ogrdb_report( ref_filename, inferred_filename, species, filename, chain, hap_gene, segment, chain_type, plot_unmutated, all_inferred = FALSE, format = "pdf" )
ref_filename |
Name of file containing IMGT-aligned reference genes in FASTA format |
inferred_filename |
Name of file containing sequences of inferred novel alleles, or '-' if none |
species |
Species name used in field 3 of the IMGT germline header with spaces omitted, if the reference file is from IMGT. Otherwise ” |
filename |
Name of file containing annotated reads in AIRR, CHANGEO or IgDiscover format. The format is detected automatically |
chain |
one of IGHV, IGKV, IGLV, IGHD, IGHJ, IGKJ, IGLJ, TRAV, TRAj, TRBV, TRBD, TRBJ, TRGV, TRGj, TRDV, TRDD, TRDJ |
hap_gene |
The haplotyping columns will be completed based on the usage of the two most frequent alleles of this gene. If NA, the column will be blank |
segment |
one of V, D, J |
chain_type |
one of H, L |
plot_unmutated |
Plot base composition using only unmutated sequences (V-chains only) |
all_inferred |
Treat all alleles as novel |
format |
The format for the plot file ('pdf', 'html' or 'none') |
None
# prepare files for example reference_set = system.file("extdata/ref_gapped.fasta", package = "ogrdbstats") inferred_set = system.file("extdata/novel_gapped.fasta", package = "ogrdbstats") repertoire = system.file("extdata/ogrdbstats_example_repertoire.tsv", package = "ogrdbstats") file.copy(repertoire, tempdir()) repfile = file.path(tempdir(), 'ogrdbstats_example_repertoire.tsv') generate_ogrdb_report(reference_set, inferred_set, 'Homosapiens', repfile, 'IGHV', NA, 'V', 'H', FALSE, format='none') #clean up outfile = file.path(tempdir(), 'ogrdbstats_example_repertoire_ogrdb_report.csv') file.remove(repfile) file.remove(outfile)
# prepare files for example reference_set = system.file("extdata/ref_gapped.fasta", package = "ogrdbstats") inferred_set = system.file("extdata/novel_gapped.fasta", package = "ogrdbstats") repertoire = system.file("extdata/ogrdbstats_example_repertoire.tsv", package = "ogrdbstats") file.copy(repertoire, tempdir()) repfile = file.path(tempdir(), 'ogrdbstats_example_repertoire.tsv') generate_ogrdb_report(reference_set, inferred_set, 'Homosapiens', repfile, 'IGHV', NA, 'V', 'H', FALSE, format='none') #clean up outfile = file.path(tempdir(), 'ogrdbstats_example_repertoire_ogrdb_report.csv') file.remove(repfile) file.remove(outfile)
Collect parameters from the command line and use them to create a report and CSV file
genotype_statistics_cmd(args = NULL)
genotype_statistics_cmd(args = NULL)
args |
A string vector containing the command line arguments. If NULL, will take them from the command line |
Nothing
# Prepare files for example reference_set = system.file("extdata/ref_gapped.fasta", package = "ogrdbstats") inferred_set = system.file("extdata/novel_gapped.fasta", package = "ogrdbstats") repertoire = system.file("extdata/ogrdbstats_example_repertoire.tsv", package = "ogrdbstats") file.copy(repertoire, tempdir()) repfile = file.path(tempdir(), 'repertoire.tsv') genotype_statistics_cmd(c( reference_set, 'Homosapiens', repfile, 'IGHV', '--inf_file', inferred_set, '--format', 'none')) # clean up outfile = file.path(tempdir(), 'repertoire_ogrdb_report.csv') plotdir = file.path(tempdir(), 'repertoire_ogrdb_plots') file.remove(repfile) file.remove(outfile) unlink(plotdir, recursive=TRUE)
# Prepare files for example reference_set = system.file("extdata/ref_gapped.fasta", package = "ogrdbstats") inferred_set = system.file("extdata/novel_gapped.fasta", package = "ogrdbstats") repertoire = system.file("extdata/ogrdbstats_example_repertoire.tsv", package = "ogrdbstats") file.copy(repertoire, tempdir()) repfile = file.path(tempdir(), 'repertoire.tsv') genotype_statistics_cmd(c( reference_set, 'Homosapiens', repfile, 'IGHV', '--inf_file', inferred_set, '--format', 'none')) # clean up outfile = file.path(tempdir(), 'repertoire_ogrdb_report.csv') plotdir = file.path(tempdir(), 'repertoire_ogrdb_plots') file.remove(repfile) file.remove(outfile) unlink(plotdir, recursive=TRUE)
Create a barplot for each allele, showing number of reads distributed by mutation count
make_barplot_grobs( input_sequences, genotype_db, inferred_seqs, genotype, segment, calculated_NC )
make_barplot_grobs( input_sequences, genotype_db, inferred_seqs, genotype, segment, calculated_NC )
input_sequences |
the input_sequences data frame |
genotype_db |
named list of gene sequences in the personalised genotype |
inferred_seqs |
named list of novel gene sequences |
genotype |
data frame created by calc_genotype |
segment |
one of V, D, J |
calculated_NC |
a boolean, TRUE if mutation counts had to be calculated, FALSE otherwise |
list of grobs
barplot_grobs = make_barplot_grobs( example_rep$input_sequences, example_rep$genotype_db, example_rep$inferred_seqs, example_rep$genotype, 'V', example_rep$calculated_NC )
barplot_grobs = make_barplot_grobs( example_rep$input_sequences, example_rep$genotype_db, example_rep$inferred_seqs, example_rep$genotype, 'V', example_rep$calculated_NC )
Create haplotyping plots
make_haplo_grobs(segment, haplo_details)
make_haplo_grobs(segment, haplo_details)
segment |
one of V, D, J |
haplo_details |
Data structure created by create_haplo_details |
named list containing the following elements:
a_allele_plot | plot showing allele usage for each potential haplotyping gene |
haplo_grobs | differential plot of allele usage for each usable haplotyping gene |
haplo_grobs = make_haplo_grobs('V', example_rep$haplo_details)
haplo_grobs = make_haplo_grobs('V', example_rep$haplo_details)
Create plots showing base usage at selected locations in sequences based on novel alleles
make_novel_base_grobs(inferred_seqs, input_sequences, segment, all_inferred)
make_novel_base_grobs(inferred_seqs, input_sequences, segment, all_inferred)
inferred_seqs |
named list of novel gene sequences |
input_sequences |
the input_sequences data frame |
segment |
one of V, D, J |
all_inferred |
true if user has requested all alleles in reference set plotted - will suppress some warnings |
named list containing the following elements:
cdr3_dist | cdr3 length distribution plots |
whole | whole-length usage plots |
end | 3' end usage plots |
conc | 3' end consensus composition plots |
triplet | 3' end triplet usage plots |
base_grobs = make_novel_base_grobs( example_rep$inferred_seqs, example_rep$input_sequences, 'V', FALSE )
base_grobs = make_novel_base_grobs( example_rep$inferred_seqs, example_rep$input_sequences, 'V', FALSE )
Read input files into memory
read_input_files( ref_filename, inferred_filename, species, filename, chain, hap_gene, segment, chain_type, all_inferred )
read_input_files( ref_filename, inferred_filename, species, filename, chain, hap_gene, segment, chain_type, all_inferred )
ref_filename |
Name of file containing IMGT-aligned reference genes in FASTA format |
inferred_filename |
Name of file containing sequences of inferred novel alleles, or '-' if none |
species |
Species name used in field 3 of the IMGT germline header with spaces omitted, if the reference file is from IMGT. Otherwise ” |
filename |
Name of file containing annotated reads in AIRR, CHANGEO or IgDiscover format. The format is detected automatically |
chain |
one of IGHV, IGKV, IGLV, IGHD, IGHJ, IGKJ, IGLJ, TRAV, TRAj, TRBV, TRBD, TRBJ, TRGV, TRGj, TRDV, TRDD, TRDJ |
hap_gene |
The haplotyping columns will be completed based on the usage of the two most frequent alleles of this gene. If NA, the column will be blank |
segment |
one of V, D, J |
chain_type |
one of H, L |
all_inferred |
Treat all alleles as novel |
A named list containing the following elements:
ref_genes | named list of IMGT-gapped reference genes |
inferred_seqs | named list of IMGT-gapped inferred (novel) sequences. |
input_sequences | data frame with one row per annotated read, with CHANGEO-style column names One key point: the column SEG_CALL is the gene call for the segment under analysis. Hence if segment is 'V', 'V_CALL' will be renamed 'SEG_CALL' whereas is segment is 'J', 'J_CALL' is renamed 'SEG_CALL'. This simplifies downstream processing. Rows in the input file with ambiguous SEG_CALLs, or no call, are removed. |
genotype_db | named list of gene sequences referenced in the annotated reads (both reference and novel sequences) |
haplo_details | data used for haplotype analysis, showing allelic ratios calculated with various potential haplotyping genes |
genotype | data frame containing information provided in the OGRDB genotype csv file |
calculated_NC | a boolean that is TRUE if mutation counts were calculated by this library, FALSE if they were read from the annotated read file |
# Create the analysis data set from example files provided with the package #(this dataset is also provided in the package as example_rep) reference_set = system.file("extdata/ref_gapped.fasta", package = "ogrdbstats") inferred_set = system.file("extdata/novel_gapped.fasta", package = "ogrdbstats") repertoire = system.file("extdata/ogrdbstats_example_repertoire.tsv", package = "ogrdbstats") example_data = read_input_files(reference_set, inferred_set, 'Homosapiens', repertoire, 'IGHV', NA, 'V', 'H', FALSE)
# Create the analysis data set from example files provided with the package #(this dataset is also provided in the package as example_rep) reference_set = system.file("extdata/ref_gapped.fasta", package = "ogrdbstats") inferred_set = system.file("extdata/novel_gapped.fasta", package = "ogrdbstats") repertoire = system.file("extdata/ogrdbstats_example_repertoire.tsv", package = "ogrdbstats") example_data = read_input_files(reference_set, inferred_set, 'Homosapiens', repertoire, 'IGHV', NA, 'V', 'H', FALSE)
Write the genotype file required by OGRDB
write_genotype_file(filename, segment, chain_type, genotype)
write_genotype_file(filename, segment, chain_type, genotype)
filename |
name of file to create (csv) |
segment |
one of V, D, J |
chain_type |
one of H, L |
genotype |
genotype data frame |
None
genotype_file = tempfile("ogrdb_genotype") write_genotype_file(genotype_file, 'V', 'H', example_rep$genotype) file.remove(genotype_file)
genotype_file = tempfile("ogrdb_genotype") write_genotype_file(genotype_file, 'V', 'H', example_rep$genotype) file.remove(genotype_file)
Create the OGRDB style plot file
write_plot_file( filename, input_sequences, cdr3_dist_grobs, end_composition_grobs, cons_composition_grobs, whole_composition_grobs, triplet_composition_grobs, barplot_grobs, a_allele_plot, haplo_grobs, message, format )
write_plot_file( filename, input_sequences, cdr3_dist_grobs, end_composition_grobs, cons_composition_grobs, whole_composition_grobs, triplet_composition_grobs, barplot_grobs, a_allele_plot, haplo_grobs, message, format )
filename |
name of file to create (pdf) |
input_sequences |
the input_sequences data frame |
cdr3_dist_grobs |
cdr3 length distribution grobs created by make_novel_base_grob |
end_composition_grobs |
end composition grobs created by make_novel_base_grobs |
cons_composition_grobs |
consensus composition grobs created by make_novel_base_grobs |
whole_composition_grobs |
whole composition grobs created by make_novel_base_grobs |
triplet_composition_grobs |
triplet composition grobs created by make_novel_base_grobs |
barplot_grobs |
barplot grobs created by make_barplot_grons |
a_allele_plot |
a_allele_plot grob created by make_haplo_grobs |
haplo_grobs |
haplo_grobs created by make_haplo_grobs |
message |
text message to display at end of report |
format |
Format of report ('pdf', 'html' or 'none') |
None
plot_file = tempfile(pattern = 'ogrdb_plots') base_grobs = make_novel_base_grobs( example_rep$inferred_seqs, example_rep$input_sequences, 'V', FALSE ) barplot_grobs = make_barplot_grobs( example_rep$input_sequences, example_rep$genotype_db, example_rep$inferred_seqs, example_rep$genotype, 'V', example_rep$calculated_NC ) haplo_grobs = make_haplo_grobs('V', example_rep$haplo_details) write_plot_file( plot_file, example_rep$input_sequences, base_grobs$cdr3_dist, base_grobs$end, base_grobs$conc, base_grobs$whole, base_grobs$triplet, barplot_grobs, haplo_grobs$aplot, haplo_grobs$haplo, "Notes on this analysis", 'none' ) file.remove(plot_file)
plot_file = tempfile(pattern = 'ogrdb_plots') base_grobs = make_novel_base_grobs( example_rep$inferred_seqs, example_rep$input_sequences, 'V', FALSE ) barplot_grobs = make_barplot_grobs( example_rep$input_sequences, example_rep$genotype_db, example_rep$inferred_seqs, example_rep$genotype, 'V', example_rep$calculated_NC ) haplo_grobs = make_haplo_grobs('V', example_rep$haplo_details) write_plot_file( plot_file, example_rep$input_sequences, base_grobs$cdr3_dist, base_grobs$end, base_grobs$conc, base_grobs$whole, base_grobs$triplet, barplot_grobs, haplo_grobs$aplot, haplo_grobs$haplo, "Notes on this analysis", 'none' ) file.remove(plot_file)