Skip to content

Outputs

The SOPRANO CLI caches data into the folder <job_cache>/<job_name>, where the job cache and name are determined from the argument flags -o | --output and -n | --name respectively.

A SOPRANO pipeline cache has the following structure

n_samples=2 file tree
job_cache
│   figure.pdf
│   pipeline.params
│   samples_df.csv
│   samples_df.meta
│   statistics.json
│   kde_config.json
└───data
│   │   data.log
│   │   data.results.tsv    
│   │   intermittent.data.tar.gz    
└───sample_0000
│   │   sample_0000.results.log
│   │   sample_0000.results.tsv
│   │   intermittent.data.tar.gz
└───sample_0001
│   │   sample_0000.results.log
│   │   sample_0000.results.tsv
│   │   intermittent.data.tar.gz

Each data or sample component of the analysis will have its own subdirectory, containing its own log file. If the job is in progress, each pipeline step will have its intermittent data cached therein. Once the sample pipeline job has complete, intermittent cache data will be tar compressed into a *.tar.gz file, leaving only the log and results file.

There are three classificatoins of outputs that we will describe in term:

  1. Universal outputs - These are always generated by a SOPRANO pipeline run, and cannot be changed after they are created.
  2. Pipeline outputs - These are generated for each SOPRANO pipeline run. Once created, these will not be overwritten or deleted. More pipeline runs can be generated via launching more samples.
  3. KDE outputs - These will be overwritten each time SOPRANO is run. They are relatively quick to compute on-the-fly, and are determined by the KDE parameters and number of samples defined.

Universal outputs

pipeline.params

This file describes the pipeline jobs parameters will be written to the root of this folder. The function of this file is two-fold:

  1. Users can refer to this file to contextualize results;
  2. SOPRANO can use this file to ensure that only pipeline definitions with identical inputs can be cached in the same location.

Note: Whilst these definitions ensure that conflicting data may not be written to the same place, it does not forbid difference in n_samples. This means that users can iteratively scale up their analysis as and when needed.

An example pipeline.params file is displayed below. Note that whilst these definitions ensure that conflicting data may not be written to the same place, it does not forbid difference in n_samples. This means that users can iteratively scale up their analysis as and when needed.

pipeline.params
{
    "bed_path": "/path/to/SOPRANO/data/example_immunopeptidomes/TCGA-05-4396.Expressed.IEDBpeps.SB.epitope.bed",
    "exclude_drivers": "True",
    "fasta": "/path/to/SOPRANO/ensembl_downloads/homo_sapiens/110_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa",
    "input_path": "/path/to/SOPRANO/data/example_annotations/TCGA-05-4396-01A-21D-1855-08.annotated",
    "job_cache": "/tmp/test_333",
    "protein_transcript_length": "/path/to/SOPRANO/data/aux_soprano/ensemble_transcript_protein.length",
    "random_regions": "None",
    "seed": "333",
    "sizes": "/path/to/SOPRANO/ensembl_downloads/homo_sapiens/110_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.chrom",
    "transcript_fasta": "/path/to/SOPRANO/data/aux_soprano/ensemble_transcriptID.fasta",
    "transcript_length": "/path/to/SOPRANO/data/aux_soprano/ensemble_transcript.length",
    "use_ssb192": "False"
}

Pipeline outputs

*.log

This is the log file for the individual SOPRANO pipeline runs.

*.results.tsv

Each data/sample will produce a TSV file in the following format:

Coverage ON_dnds ON_Low_CI ON_High_CI ON_Mutations OFF_dNdS OFF_Low_CI OFF_High_CI OFF_Mutations Pvalue ON_na ON_NA ON_ns ON_NS OFF_na OFF_NA OFF_ns OFF_NS
Exonic_Only
Exonic_Intronic

Only first row (Exonic_Only) of the table data will be generated if there are non-zero intronic rate of mutations. The SOPRANO algorithm uses intronic mutations to improve the background counts of silent mutations.

The empirical counts of mutation numbers are computed on the fly by the SOPRANO pipeline. Statistical quantities placed in this table are analytic estimates based on the empirical mutation counts using Katz method.

  • ON_dnds dN/dS of the target region provided in the bed file

  • ON_lowci lower value for the 95% CI of the target

  • ON_highci upper value for the 95% CI of the target

  • ON_muts number of mutations observed inside the target region

  • OFF_dnds dN/dS of the OFF-target region provided in the bed file

  • OFF_lowci lower value for the 95% CI of the OFF-target

  • OFF_highci upper value for the 95% CI of the OFF-target

  • OFF_muts number of mutations observed outside the target region

  • P-val P-value estimated from the comparison of the confidence intervals from ON and OFF dN/dS values

  • ON_na Observed number of nonsilent mutations ON target

  • ON_NA Number of nonsilent sites (corrected) ON target

  • ON_ns Observed number of silent mutations ON target

  • ON_NS Number of silent sites (corrected) ON target

  • OFF_na Observed number of nonsilent mutations OFF target

  • OFF_NA Number of nonsilent sites (corrected) OFF target

  • OFF_ns Number of silent sites (corrected) OFF target

  • OFF_NS Number of silent sites (corrected) OFF target

intermittent.data.tar.gz

Compressed tar archive containing the intermittent data produced by the SOPRANO pipeline run for the data/sample.


KDE outputs

When the number of samples --n_samples is specified and non-zero, SOPRANO performs a kernel density estimate based on the sample files. This is the bases of a numerical downstream analysis that produces an additional set of files.

figure.pdf

A figure containing histogram data of the samples, overlayed by the corresponding kernel densities, with vertical markers for the data dN/dS values.

samples_df.csv

A vertical concatenation of the sample outputs used to estimate the kernel density. It pulls from the sample_XXXX.results.tsv files.

samples_df.meta

Contains the meta regarding which sample results were used to esimtate the KDE.

kde_config.json

The parameters used to perform the KDE. This defines an explicit record of the parameter space search to fit the Gaussian kernel, and the parameters used to integrate the KDE to obtain pvalues.

statistics.json

Contains the statistical inferences from the samples and kernel density estimates. The general structure is given below. Theintronic_only key and subsequent value definitions will only exist if they are found in the data.

The samples key contains summary statistics from the samples, namely their mean dN/dS values and standard deviation.

The data key contains the dN/dS value for the non-randomized input run through SOPRANO. The p_value_left and p_value_right values are computed by integrating the kernel density from left and right tails respectively, up until the data dN/dS prediction.

statistics.json
{
  "ON_dNdS": {
    "exonic_only": {
      "samples": {
        "mean_value": ...,
        "std_dev": ...
      },
      "data": {
        "value": ...,
        "p_value_left": ...,
        "p_value_right": ...
      }
    },
    "exonic_intronic": {
      "samples": {
        "mean_value": ...,
        "std_dev": ...
      },
      "data": {
        "value": ...,
        "p_value_left": ...,
        "p_value_right": ...
      }
    }
  },
  "OFF_dNdS": {
    "exonic_only": {
      "samples": {
        "mean_value": ...,
        "std_dev": ...
      },
      "data": {
        "value": ...,
        "p_value_left": ...,
        "p_value_right": ...
      }
    },
    "exonic_intronic": {
      "samples": {
        "mean_value": ...,
        "std_dev": ...
      },
      "data": {
        "value": ...,
        "p_value_left": ...,
        "p_value_right": ...
      }
    }
  }
}