Config parameter file
- Configuration files are used to tell TEFLoN2 what to use.
TEFLoN2 has two configuration files:
Workflow configuration
Cluster configuration
Configuration of TEFLoN2
TEFLoN2 uses Snakemake to perform its analyses. You have then first to provide your parameters in a .yaml file (see an example in the config.yaml file). Parameters are :
#all path can be relatif or absolute
DEPENDANCES:
PYTHON3: "/path/to/python3" #[required] path to python 3 executable
BWA: "path/to/bwa" #[required] path to bwa executable
SAMTOOLS: "path/to/samtools" #[required] path to samtools executable
REPEATMASKER: "path/to/repeatmasker" #[requiered if prep_custom] path to repeatmasker executable
DATA_INPUT:
WORKING_DIRECTORY: "data_input" # [required] path to arbresecne of your data input. Default folder is data_input
GENOME: "name_reference_file.fasta" #[required] name of reference file in data_input/reference
ANNOTATION: "name_annotation_file.bed" #[required if prep_annotation] name of annotation file in data_input/library
LIBRARY: "name_TE_library.fasta" #[required if prep_custom][optional if prep_annotation] name of TE library file for your organism in data_input/library
HIERARCHY: "name_TE_hirarchiy.txt" #[requiered if prep_annotation] name of TE hirarchy file in data_input/library
PARAMS:
GENERAL:
WORKING_DIRECTORY: "" #[optional] path to working directory. Default: in the current directory
PREFIX: "name_run" #[required] name of run
MEMORY: 30000 #[required] Memory in mb to be used by the rules. If a rule does not work increase this number
MEMORY_SUPP: 20000 #[required] If you use the --retries option of snakemake, if a rule fails to use memory, add to the memory this extra.
COMPRESS:
THREADS: 10 #[required] maximum number of threads the rule can use
SAMTOOLS:
THREADS: 10 #[required] maximum number of threads the rule can use
PREP_CUSTOM:
SPECIES: "" #Specify the species or clade of the input sequence. The species name must be a valid NCBI Taxonomy Database species name and be containedin the RepeatMasker repeat database. Some examples are: human,mouse,rattus,"ciona savignyi",arabidopsis,mammal, carnivore, rodentia, rat, cow, pig, cat, dog, chicken, fugu,danio, "ciona intestinalis" drosophila, anopheles, worm, diatoaea,artiodactyl, arabidopsis, rice, wheat, and maize
CUTOFF: "" #[optional] SW cutoff score for RepeatMasker, default=250
FRAG: "" #[optional] Maximum sequence length masked without fragmenting for RepepatMasker (default 60000)
ENGINE: "" #[optional][crossmatch|wublast|abblast|ncbi|rmblast|hmmer] Use an alternate search engine to the default for RepeatMasker.
MIN_LENGTH: "" #[optional] minimum length for RepeatMasker-predicted TE to be reported in the final annotation, default=200
SPLIT_DIST: "" #[optional] minimum length for RepeatMasker-predicted TE to be reported in the final annotation, default=200
DIVERGENCE: "" #[optional] only those repeats < x percent diverged from the consensus seq will be included in final annotation, default=20
THREADS: 16 #[required] maximum number of threads the rule can use
DISCOVER:
LEVEL_HIERARCHY1: "family" #[required] level of the hierarchy file to guide initial TE search. #NOTE: It is recommended that you use the lowest level in the hierarchy file (i.e. "family" for data without a user-curated hierarchy)
LEVEL_HIERARCHY2: "class" #[required] level of the hierarchy to group similar TEs. #NOTE: This must be either the same level of the hierarchy used in -l1 or a higher level (clustering at higher levels will reduce the number of TE instances found, but improve accuracy for discriminating TE identity)
QUALITY: "20" #[required] map quality threshold #NOTE: Mapped reads with map qualities lower than this number will be discarded
EXCLUDE: "" #[optional] newline separated file containing the name of any TE families to exclude from analysis #NOTE: Use same names as in column one of the hierarchy file
STANDARD_DEVIATION: "" #[optional] insert size standard deviation #NOTE: Used to manually override the insert size StdDev identified by samtools stat (check this number in the generated stats.txt file to ensure it seems more or less correct based on knowledge of sequencing library!)
COVERAGE_OVERRIDE: "" #[optional] coverage override #Note: Used to manually override the coverage estimate if you get the error: "Warning: coverage could not be estimated"
THREADS: 16 #[required] maximum number of threads the rule can use
COLLAPSE:
THRESHOLD_SAMPLE: "1" #[required] TEs must be supported by >= n reads in at least one sample
THRESHOLD_ALL: "1" #[required] TEs must be supported by >= n reads summed across all samples
COVERAGE_OVERRIDE: "" #[optional] coverage override #Note: Used to manually override the coverage estimate if you get the error: "Warning: coverage could not be estimated"
QUALITY: "20" #[required] map quality threshold
THREADS: 16 #[required] maximum number of threads the rule can use
COUNT:
QUALITY: "20" #[required] map quality threshold
THREADS: 20 #[required] maximum number of threads the rule can use
GENOTYPE:
THRESHOLD_LOWER: "1" #[optinal] sites genotyped as -9 if adjusted read counts lower than this threshold, default=1
THRESHOLD_HIGHER: "100" #[optinal] sites genotyped as -9 if adjusted read counts higher than this threshold, default=mean_coverage + 2*STDEV
DATA_TYPE: "pooled" #[required] must be either haploid, diploid, or pooled
SAMPLE:
THRESHOLD_ABSENCE: "" #[optinal] lower threshold used to define whether insertions are present, polymorphic, heterozygous, absent or no data. Value between 0 and 1 (default=0.05)
THRESHOLD_PRESENCE: "" #[optinal] hight threshold used to define whether insertions are present, polymorphic, heterozygous, absent or no data. Value between 0 and 1 (default=0.95)
POPULATION:
FILE: "" #[optional] path to population file
THRESHOLD_ABSENCE: "" #[optinal] lower threshold used to define whether insertions are present, polymorphic or absent at population level. Value between 0 and 1 (default=0.05)
THRESHOLD_PRESENCE: "" #[optinal] Hight threshold used to define whether insertions are present, polymorphic or absent at population level.Value between 0 and 1 (default=0.95)
GENOME: Fasta file containing the reference genome of the species of interest.
LIBRARY: A Multifasta file containing the canonical sequence of transposable elements
ANNOTATION: BED file containing the TE annotation in reference genome.
LEVEL_HIERARCHY1/2, THEASHOLD_SAMPLE/ALL: Required thresholds to operate TEFLoN2
QUALITY: Map quality threshold for each steap
TOOLS: Path to the tools
THREADS: Number of threads to execute a step
THRESHOLD : Threshold for data analysis
DATA_TYPE: Type of data for different interpretations
Config Cluster
As an example, a cluster configuration file is provided, but it is not exhaustive and is specific to the cluster we have used.
__default__:
partition: fast
cpus: 1
output: "logs_slurm/{rule}.{wildcards}.out" ## redirect slurm-JOBID.txt to your directory
error: "logs_slurm/{rule}.{wildcards}.err" ## redirect slurm-JOBID.txt to your directory
mem: 2000
teflon_prep_custom:
cpus: "{threads}" ## => use `threads` define in rule
mem: "{resources.mem_mb}"
mapping:
cpus: "{threads}"
mem: "{resources.mem_mb}"
samtools_view:
cpus: "{threads}"
mem: "{resources.mem_mb}"
teflon_discover:
cpus: "{threads}"
mem: "{resources.mem_mb}"
teflon_collapse:
cpus: "{threads}"
mem: "{resources.mem_mb}"
teflon_count:
cpus: "{threads}"
mem: "{resources.mem_mb}"
teflon_genotype_individual:
mem: "{resources.mem_mb}"
teflon_genotype_all:
mem: "{resources.mem_mb}"
teflon_genotype_population:
mem: "{resources.mem_mb}"