Outputs

The SOPRANO CLI caches data into the folder <job_cache>/<job_name>, where the job cache and name are determined from the argument flags -o | --output and -n | --name respectively.

A SOPRANO pipeline cache has the following structure

n_samples=2 file tree

job_cache
│   figure.pdf
│   pipeline.params
│   samples_df.csv
│   samples_df.meta
│   statistics.json
│   kde_config.json
│
└───data
│   │   data.log
│   │   data.results.tsv    
│   │   intermittent.data.tar.gz    
│
└───sample_0000
│   │   sample_0000.results.log
│   │   sample_0000.results.tsv
│   │   intermittent.data.tar.gz
│
└───sample_0001
│   │   sample_0000.results.log
│   │   sample_0000.results.tsv
│   │   intermittent.data.tar.gz

Each data or sample component of the analysis will have its own subdirectory, containing its own log file. If the job is in progress, each pipeline step will have its intermittent data cached therein. Once the sample pipeline job has complete, intermittent cache data will be tar compressed into a *.tar.gz file, leaving only the log and results file.

There are three classificatoins of outputs that we will describe in term:

Universal outputs - These are always generated by a SOPRANO pipeline run, and cannot be changed after they are created.
Pipeline outputs - These are generated for each SOPRANO pipeline run. Once created, these will not be overwritten or deleted. More pipeline runs can be generated via launching more samples.
KDE outputs - These will be overwritten each time SOPRANO is run. They are relatively quick to compute on-the-fly, and are determined by the KDE parameters and number of samples defined.

Universal outputs

`pipeline.params`

This file describes the pipeline jobs parameters will be written to the root of this folder. The function of this file is two-fold:

Users can refer to this file to contextualize results;
SOPRANO can use this file to ensure that only pipeline definitions with identical inputs can be cached in the same location.

Note: Whilst these definitions ensure that conflicting data may not be written to the same place, it does not forbid difference in n_samples. This means that users can iteratively scale up their analysis as and when needed.

An example pipeline.params file is displayed below. Note that whilst these definitions ensure that conflicting data may not be written to the same place, it does not forbid difference in n_samples. This means that users can iteratively scale up their analysis as and when needed.

pipeline.params

{
    "bed_path": "/path/to/SOPRANO/data/example_immunopeptidomes/TCGA-05-4396.Expressed.IEDBpeps.SB.epitope.bed",
    "exclude_drivers": "True",
    "fasta": "/path/to/SOPRANO/ensembl_downloads/homo_sapiens/110_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa",
    "input_path": "/path/to/SOPRANO/data/example_annotations/TCGA-05-4396-01A-21D-1855-08.annotated",
    "job_cache": "/tmp/test_333",
    "protein_transcript_length": "/path/to/SOPRANO/data/aux_soprano/ensemble_transcript_protein.length",
    "random_regions": "None",
    "seed": "333",
    "sizes": "/path/to/SOPRANO/ensembl_downloads/homo_sapiens/110_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.chrom",
    "transcript_fasta": "/path/to/SOPRANO/data/aux_soprano/ensemble_transcriptID.fasta",
    "transcript_length": "/path/to/SOPRANO/data/aux_soprano/ensemble_transcript.length",
    "use_ssb192": "False"
}

Pipeline outputs

`*.log`

This is the log file for the individual SOPRANO pipeline runs.

`*.results.tsv`

Each data/sample will produce a TSV file in the following format:

Coverage	ON_dnds	ON_Low_CI	ON_High_CI	ON_Mutations	OFF_dNdS	OFF_Low_CI	OFF_High_CI	OFF_Mutations	Pvalue	ON_na	ON_NA	ON_ns	ON_NS	OFF_na	OFF_NA	OFF_ns	OFF_NS
Exonic_Only
Exonic_Intronic

Only first row (Exonic_Only) of the table data will be generated if there are non-zero intronic rate of mutations. The SOPRANO algorithm uses intronic mutations to improve the background counts of silent mutations.

The empirical counts of mutation numbers are computed on the fly by the SOPRANO pipeline. Statistical quantities placed in this table are analytic estimates based on the empirical mutation counts using Katz method.

ON_dnds dN/dS of the target region provided in the bed file
ON_lowci lower value for the 95% CI of the target
ON_highci upper value for the 95% CI of the target
ON_muts number of mutations observed inside the target region
OFF_dnds dN/dS of the OFF-target region provided in the bed file
OFF_lowci lower value for the 95% CI of the OFF-target
OFF_highci upper value for the 95% CI of the OFF-target
OFF_muts number of mutations observed outside the target region
P-val P-value estimated from the comparison of the confidence intervals from ON and OFF dN/dS values
ON_na Observed number of nonsilent mutations ON target
ON_NA Number of nonsilent sites (corrected) ON target
ON_ns Observed number of silent mutations ON target
ON_NS Number of silent sites (corrected) ON target
OFF_na Observed number of nonsilent mutations OFF target
OFF_NA Number of nonsilent sites (corrected) OFF target
OFF_ns Number of silent sites (corrected) OFF target
OFF_NS Number of silent sites (corrected) OFF target

`intermittent.data.tar.gz`

Compressed tar archive containing the intermittent data produced by the SOPRANO pipeline run for the data/sample.

KDE outputs

When the number of samples --n_samples is specified and non-zero, SOPRANO performs a kernel density estimate based on the sample files. This is the bases of a numerical downstream analysis that produces an additional set of files.

`figure.pdf`

A figure containing histogram data of the samples, overlayed by the corresponding kernel densities, with vertical markers for the data dN/dS values.

`samples_df.csv`

A vertical concatenation of the sample outputs used to estimate the kernel density. It pulls from the sample_XXXX.results.tsv files.

`samples_df.meta`

Contains the meta regarding which sample results were used to esimtate the KDE.

`kde_config.json`

The parameters used to perform the KDE. This defines an explicit record of the parameter space search to fit the Gaussian kernel, and the parameters used to integrate the KDE to obtain pvalues.

`statistics.json`

Contains the statistical inferences from the samples and kernel density estimates. The general structure is given below. Theintronic_only key and subsequent value definitions will only exist if they are found in the data.

The samples key contains summary statistics from the samples, namely their mean dN/dS values and standard deviation.

The data key contains the dN/dS value for the non-randomized input run through SOPRANO. The p_value_left and p_value_right values are computed by integrating the kernel density from left and right tails respectively, up until the data dN/dS prediction.

statistics.json

{
  "ON_dNdS": {
    "exonic_only": {
      "samples": {
        "mean_value": ...,
        "std_dev": ...
      },
      "data": {
        "value": ...,
        "p_value_left": ...,
        "p_value_right": ...
      }
    },
    "exonic_intronic": {
      "samples": {
        "mean_value": ...,
        "std_dev": ...
      },
      "data": {
        "value": ...,
        "p_value_left": ...,
        "p_value_right": ...
      }
    }
  },
  "OFF_dNdS": {
    "exonic_only": {
      "samples": {
        "mean_value": ...,
        "std_dev": ...
      },
      "data": {
        "value": ...,
        "p_value_left": ...,
        "p_value_right": ...
      }
    },
    "exonic_intronic": {
      "samples": {
        "mean_value": ...,
        "std_dev": ...
      },
      "data": {
        "value": ...,
        "p_value_left": ...,
        "p_value_right": ...
      }
    }
  }
}