===================== Config parameter file ===================== Configuration files are used to tell TEFLoN2 what to use. TEFLoN2 has two configuration files: * Workflow configuration * Cluster configuration Configuration of TEFLoN2 ------------------------ TEFLoN2 uses Snakemake to perform its analyses. You have then first to provide your parameters in a .yaml file (see an example in the config.yaml file). Parameters are : .. code-block:: yaml #all path can be relatif or absolute DEPENDANCES: PYTHON3: "/path/to/python3" #[required] path to python 3 executable BWA: "path/to/bwa" #[required] path to bwa executable SAMTOOLS: "path/to/samtools" #[required] path to samtools executable REPEATMASKER: "path/to/repeatmasker" #[requiered if prep_custom] path to repeatmasker executable DATA_INPUT: WORKING_DIRECTORY: "data_input" # [required] path to arbresecne of your data input. Default folder is data_input GENOME: "name_reference_file.fasta" #[required] name of reference file in data_input/reference ANNOTATION: "name_annotation_file.bed" #[required if prep_annotation] name of annotation file in data_input/library LIBRARY: "name_TE_library.fasta" #[required if prep_custom][optional if prep_annotation] name of TE library file for your organism in data_input/library HIERARCHY: "name_TE_hirarchiy.txt" #[requiered if prep_annotation] name of TE hirarchy file in data_input/library PARAMS: GENERAL: WORKING_DIRECTORY: "" #[optional] path to working directory. Default: in the current directory PREFIX: "name_run" #[required] name of run MEMORY: 30000 #[required] Memory in mb to be used by the rules. If a rule does not work increase this number MEMORY_SUPP: 20000 #[required] If you use the --retries option of snakemake, if a rule fails to use memory, add to the memory this extra. COMPRESS: THREADS: 10 #[required] maximum number of threads the rule can use SAMTOOLS: THREADS: 10 #[required] maximum number of threads the rule can use PREP_CUSTOM: SPECIES: "" #Specify the species or clade of the input sequence. The species name must be a valid NCBI Taxonomy Database species name and be containedin the RepeatMasker repeat database. Some examples are: human,mouse,rattus,"ciona savignyi",arabidopsis,mammal, carnivore, rodentia, rat, cow, pig, cat, dog, chicken, fugu,danio, "ciona intestinalis" drosophila, anopheles, worm, diatoaea,artiodactyl, arabidopsis, rice, wheat, and maize CUTOFF: "" #[optional] SW cutoff score for RepeatMasker, default=250 FRAG: "" #[optional] Maximum sequence length masked without fragmenting for RepepatMasker (default 60000) ENGINE: "" #[optional][crossmatch|wublast|abblast|ncbi|rmblast|hmmer] Use an alternate search engine to the default for RepeatMasker. MIN_LENGTH: "" #[optional] minimum length for RepeatMasker-predicted TE to be reported in the final annotation, default=200 SPLIT_DIST: "" #[optional] minimum length for RepeatMasker-predicted TE to be reported in the final annotation, default=200 DIVERGENCE: "" #[optional] only those repeats < x percent diverged from the consensus seq will be included in final annotation, default=20 THREADS: 16 #[required] maximum number of threads the rule can use DISCOVER: LEVEL_HIERARCHY1: "family" #[required] level of the hierarchy file to guide initial TE search. #NOTE: It is recommended that you use the lowest level in the hierarchy file (i.e. "family" for data without a user-curated hierarchy) LEVEL_HIERARCHY2: "class" #[required] level of the hierarchy to group similar TEs. #NOTE: This must be either the same level of the hierarchy used in -l1 or a higher level (clustering at higher levels will reduce the number of TE instances found, but improve accuracy for discriminating TE identity) QUALITY: "20" #[required] map quality threshold #NOTE: Mapped reads with map qualities lower than this number will be discarded EXCLUDE: "" #[optional] newline separated file containing the name of any TE families to exclude from analysis #NOTE: Use same names as in column one of the hierarchy file STANDARD_DEVIATION: "" #[optional] insert size standard deviation #NOTE: Used to manually override the insert size StdDev identified by samtools stat (check this number in the generated stats.txt file to ensure it seems more or less correct based on knowledge of sequencing library!) COVERAGE_OVERRIDE: "" #[optional] coverage override #Note: Used to manually override the coverage estimate if you get the error: "Warning: coverage could not be estimated" THREADS: 16 #[required] maximum number of threads the rule can use COLLAPSE: THRESHOLD_SAMPLE: "1" #[required] TEs must be supported by >= n reads in at least one sample THRESHOLD_ALL: "1" #[required] TEs must be supported by >= n reads summed across all samples COVERAGE_OVERRIDE: "" #[optional] coverage override #Note: Used to manually override the coverage estimate if you get the error: "Warning: coverage could not be estimated" QUALITY: "20" #[required] map quality threshold THREADS: 16 #[required] maximum number of threads the rule can use COUNT: QUALITY: "20" #[required] map quality threshold THREADS: 20 #[required] maximum number of threads the rule can use GENOTYPE: THRESHOLD_LOWER: "1" #[optinal] sites genotyped as -9 if adjusted read counts lower than this threshold, default=1 THRESHOLD_HIGHER: "100" #[optinal] sites genotyped as -9 if adjusted read counts higher than this threshold, default=mean_coverage + 2*STDEV DATA_TYPE: "pooled" #[required] must be either haploid, diploid, or pooled SAMPLE: THRESHOLD_ABSENCE: "" #[optinal] lower threshold used to define whether insertions are present, polymorphic, heterozygous, absent or no data. Value between 0 and 1 (default=0.05) THRESHOLD_PRESENCE: "" #[optinal] hight threshold used to define whether insertions are present, polymorphic, heterozygous, absent or no data. Value between 0 and 1 (default=0.95) POPULATION: FILE: "" #[optional] path to population file THRESHOLD_ABSENCE: "" #[optinal] lower threshold used to define whether insertions are present, polymorphic or absent at population level. Value between 0 and 1 (default=0.05) THRESHOLD_PRESENCE: "" #[optinal] Hight threshold used to define whether insertions are present, polymorphic or absent at population level.Value between 0 and 1 (default=0.95) * GENOME: Fasta file containing the reference genome of the species of interest. * LIBRARY: A Multifasta file containing the canonical sequence of transposable elements * ANNOTATION: BED file containing the TE annotation in reference genome. * LEVEL_HIERARCHY1/2, THEASHOLD_SAMPLE/ALL: Required thresholds to operate TEFLoN2 * QUALITY: Map quality threshold for each steap * TOOLS: Path to the tools * THREADS: Number of threads to execute a step * THRESHOLD : Threshold for data analysis * DATA_TYPE: Type of data for different interpretations Config Cluster -------------- As an example, a cluster configuration file is provided, but it is not exhaustive and is specific to the cluster we have used. .. code-block:: yaml __default__: partition: fast cpus: 1 output: "logs_slurm/{rule}.{wildcards}.out" ## redirect slurm-JOBID.txt to your directory error: "logs_slurm/{rule}.{wildcards}.err" ## redirect slurm-JOBID.txt to your directory mem: 2000 teflon_prep_custom: cpus: "{threads}" ## => use `threads` define in rule mem: "{resources.mem_mb}" mapping: cpus: "{threads}" mem: "{resources.mem_mb}" samtools_view: cpus: "{threads}" mem: "{resources.mem_mb}" teflon_discover: cpus: "{threads}" mem: "{resources.mem_mb}" teflon_collapse: cpus: "{threads}" mem: "{resources.mem_mb}" teflon_count: cpus: "{threads}" mem: "{resources.mem_mb}" teflon_genotype_individual: mem: "{resources.mem_mb}" teflon_genotype_all: mem: "{resources.mem_mb}" teflon_genotype_population: mem: "{resources.mem_mb}"