Outputs
The SOPRANO CLI caches data into the folder <job_cache>/<job_name>
, where
the job cache and name are determined from the argument flags -o | --output
and -n | --name
respectively.
A SOPRANO pipeline cache has the following structure
job_cache
│ figure.pdf
│ pipeline.params
│ samples_df.csv
│ samples_df.meta
│ statistics.json
│ kde_config.json
│
└───data
│ │ data.log
│ │ data.results.tsv
│ │ intermittent.data.tar.gz
│
└───sample_0000
│ │ sample_0000.results.log
│ │ sample_0000.results.tsv
│ │ intermittent.data.tar.gz
│
└───sample_0001
│ │ sample_0000.results.log
│ │ sample_0000.results.tsv
│ │ intermittent.data.tar.gz
Each data or sample component of the analysis will have its own subdirectory,
containing its own log file. If the job is in progress, each pipeline step
will have its intermittent data cached therein. Once the sample pipeline job
has complete, intermittent cache data will be tar compressed into
a *.tar.gz
file, leaving only the log and results file.
There are three classificatoins of outputs that we will describe in term:
- Universal outputs - These are always generated by a SOPRANO pipeline run, and cannot be changed after they are created.
- Pipeline outputs - These are generated for each SOPRANO pipeline run. Once created, these will not be overwritten or deleted. More pipeline runs can be generated via launching more samples.
- KDE outputs - These will be overwritten each time SOPRANO is run. They are relatively quick to compute on-the-fly, and are determined by the KDE parameters and number of samples defined.
Universal outputs
pipeline.params
This file describes the pipeline jobs parameters will be written to the root of this folder. The function of this file is two-fold:
- Users can refer to this file to contextualize results;
- SOPRANO can use this file to ensure that only pipeline definitions with identical inputs can be cached in the same location.
Note: Whilst these definitions ensure that conflicting data may not be
written to the same place, it does not forbid difference in n_samples
.
This means that users can iteratively scale up their analysis as and when
needed.
An example pipeline.params
file is displayed below. Note that whilst these
definitions ensure that conflicting data may not be written to the same place,
it does not forbid difference in n_samples
. This means that users can
iteratively scale up their analysis as and when needed.
{
"bed_path": "/path/to/SOPRANO/data/example_immunopeptidomes/TCGA-05-4396.Expressed.IEDBpeps.SB.epitope.bed",
"exclude_drivers": "True",
"fasta": "/path/to/SOPRANO/ensembl_downloads/homo_sapiens/110_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa",
"input_path": "/path/to/SOPRANO/data/example_annotations/TCGA-05-4396-01A-21D-1855-08.annotated",
"job_cache": "/tmp/test_333",
"protein_transcript_length": "/path/to/SOPRANO/data/aux_soprano/ensemble_transcript_protein.length",
"random_regions": "None",
"seed": "333",
"sizes": "/path/to/SOPRANO/ensembl_downloads/homo_sapiens/110_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.chrom",
"transcript_fasta": "/path/to/SOPRANO/data/aux_soprano/ensemble_transcriptID.fasta",
"transcript_length": "/path/to/SOPRANO/data/aux_soprano/ensemble_transcript.length",
"use_ssb192": "False"
}
Pipeline outputs
*.log
This is the log file for the individual SOPRANO pipeline runs.
*.results.tsv
Each data/sample will produce a TSV file in the following format:
Coverage | ON_dnds | ON_Low_CI | ON_High_CI | ON_Mutations | OFF_dNdS | OFF_Low_CI | OFF_High_CI | OFF_Mutations | Pvalue | ON_na | ON_NA | ON_ns | ON_NS | OFF_na | OFF_NA | OFF_ns | OFF_NS |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Exonic_Only | |||||||||||||||||
Exonic_Intronic |
Only first row (Exonic_Only
) of the table data will be generated if there are
non-zero intronic rate of mutations. The SOPRANO algorithm uses intronic
mutations to improve the background counts of silent mutations.
The empirical counts of mutation numbers are computed on the fly by the SOPRANO pipeline. Statistical quantities placed in this table are analytic estimates based on the empirical mutation counts using Katz method.
-
ON_dnds
dN/dS of the target region provided in the bed file -
ON_lowci
lower value for the 95% CI of the target -
ON_highci
upper value for the 95% CI of the target -
ON_muts
number of mutations observed inside the target region -
OFF_dnds
dN/dS of the OFF-target region provided in the bed file -
OFF_lowci
lower value for the 95% CI of the OFF-target -
OFF_highci
upper value for the 95% CI of the OFF-target -
OFF_muts
number of mutations observed outside the target region -
P-val
P-value estimated from the comparison of the confidence intervals from ON and OFF dN/dS values -
ON_na
Observed number of nonsilent mutations ON target -
ON_NA
Number of nonsilent sites (corrected) ON target -
ON_ns
Observed number of silent mutations ON target -
ON_NS
Number of silent sites (corrected) ON target -
OFF_na
Observed number of nonsilent mutations OFF target -
OFF_NA
Number of nonsilent sites (corrected) OFF target -
OFF_ns
Number of silent sites (corrected) OFF target -
OFF_NS
Number of silent sites (corrected) OFF target
intermittent.data.tar.gz
Compressed tar archive containing the intermittent data produced by the SOPRANO pipeline run for the data/sample.
KDE outputs
When the number of samples --n_samples
is specified and non-zero, SOPRANO
performs a kernel density estimate based on the sample files. This is the
bases of a numerical downstream analysis that produces an additional
set of files.
figure.pdf
A figure containing histogram data of the samples, overlayed by the corresponding kernel densities, with vertical markers for the data dN/dS values.
samples_df.csv
A vertical concatenation of the sample outputs used to estimate the
kernel density. It pulls from the sample_XXXX.results.tsv
files.
samples_df.meta
Contains the meta regarding which sample results were used to esimtate the KDE.
kde_config.json
The parameters used to perform the KDE. This defines an explicit record of the parameter space search to fit the Gaussian kernel, and the parameters used to integrate the KDE to obtain pvalues.
statistics.json
Contains the statistical inferences from the samples and kernel density
estimates. The general structure is given below. Theintronic_only
key
and subsequent value definitions will only exist if they are found in the data.
The samples
key contains summary statistics from the samples, namely their
mean dN/dS values and standard deviation.
The data
key contains the dN/dS value for the non-randomized input run
through SOPRANO. The p_value_left
and p_value_right
values are computed
by integrating the kernel density from left and right tails respectively,
up until the data dN/dS prediction.
{
"ON_dNdS": {
"exonic_only": {
"samples": {
"mean_value": ...,
"std_dev": ...
},
"data": {
"value": ...,
"p_value_left": ...,
"p_value_right": ...
}
},
"exonic_intronic": {
"samples": {
"mean_value": ...,
"std_dev": ...
},
"data": {
"value": ...,
"p_value_left": ...,
"p_value_right": ...
}
}
},
"OFF_dNdS": {
"exonic_only": {
"samples": {
"mean_value": ...,
"std_dev": ...
},
"data": {
"value": ...,
"p_value_left": ...,
"p_value_right": ...
}
},
"exonic_intronic": {
"samples": {
"mean_value": ...,
"std_dev": ...
},
"data": {
"value": ...,
"p_value_left": ...,
"p_value_right": ...
}
}
}
}