############################################################
### Metadata file to the dataset "eneotenicus-genome_v1" ###
############################################################
Author: Ondrej Luksan
Date: 11/06/2025


1. Dataset composition
	The dataset consists of five files archived in a .zip archive. The files are as follows:
	* eneotenicus-genome_v1.fa - draft genome assembly in FASTA format
	* eneotenicus-genome_v1.gtf - gene annotation in GTF format
	* eneotenicus-genome_v1.cds - annotated protein-coding sequences in FASTA format
	* eneotenicus-genome_v1.aa - annotated protein sequences in FASTA format
	* eneotenicus-genome_v1.md - metadata containing information about how was the dataset generated

2. Matherial and sequencing analysis
	High molecular weight DNA was isolated by organic extraction from 4 male and 8 female nymphs of Embiratermes neotenicus (whole bodies without digestive tube) collected in French Guiana in 2022 at tropical forest locality along the Road to Petit Saut (N5°04.250′ W52°58.770′–N5°04.650′ W53°01.360′). High-throughput sequencing was performed on one Oxford Nanopore PromethION flowcell providing in total 5,924,978 reads with median read length 13,664 corresponding to ~82 Gb.
	
3. Data processing and genome assembly
	 The quality of Oxford Nanopore sequencing data was evaluated using fastQC v0.12.1 and NanoPlot v1.30.1, and residual sequencing adapters were removed using Porechop v0.2.4. Draft genome was assembled using Flye v2.9.4 with 5 subsequent polishing iterations. High-quality Illumina RNA data were mapped to newly assembled draft genome with Segemehl aligner v0.3.4 and used for an additional round of assembly polishing with Pilon algorithm v1.24. In a final step, all contigs shorter than 1 kb were removed, and draft genome assembly quality was evaluated with Quast v5.0.2, BUSCO v5.4.4 using the insecta_odb10 HMM library, and Merqury using Meryl v1.3.
	 The draft genome coveres 99.4% (97.4% single-copy, 2.0% duplicated, 0.4% fragmented, 0.2% missing) from a total of 1,367 benchmarking universal single-copy orthologs (BUSCOs) of insecta_odb10 lineage. After inspection of low complexity regions and genome masking using EDTA pipeline, we performed an in silico annotation of protein-coding genes by combining exon discovery based on mapping of caste-specific transcriptomic data with sequence homology predictions using gene sets from published isopteran assemblies of Zootermopsis nevadensis, Cryptotermes secundus, and Reticulitermes speratus within one round of BRAKER3 automated pipeline. Another 33 genes were annotated manually. This approach provided in total 26,293 gene models covering 92.8% insect BUSCOs. Protein and protein-coding sequences were obtaimned from genomic fasta nased on GTF annotation using the AGAT toolkit. 
