Other programs for "Input a Variant (VCF) file".
There are several variations of the program for registering Variant information in the database. The choice of which to use is determined by the DB mode and the number of samples (=accessions).
- 1. For standard dataset (less than 1000 samples)
-
See here for standard dataset such as:
- The DB mode is "Multi-sample VCF(mvcf)"
- And the number of samples (=accessions) is less than 1000
- 2. For large dataset (more than 1000 samples)
-
Use this program in the following cases:
- The DB mode is "Multi-sample VCF(mvcf)"
- And the number of samples (=accessions) exceeds 1000. And you want to register the data in the database as quickly as possible
- And the host server has sufficient memory(10GB or more, depending on the size of the dataset)
This program requires 10GB or more of memory and generates hundreds of GB of temporary files. If there is not enough memory, an OutOfMemory error may occur, causing the MySQL (or MariaDB) process to crash."-k" option only checks the correspondence between VCF samples and DB AccessionIDs. DB registration is not performed. It is strongly recommended to perform this check before DB registration.
Increasing the number of threads (-t) increases the memory required linearly. It is safer to leave this at the default or reduce it.Command:An example of the SampleName conversion table file (-m) is as follows. Samples not described here will not be registered. The order of name lines does not affect.
$ tasuke_variant_vcf_multi_over1000s.pl -db<database name>-u<user>-p<password>-f<VCF file>
Either required (*1):
-n<IDs>: Comma-separated AccessionID list (Corresponds to the order of VCF sample names) (*2)
-m<Path>: Path to "VCFsampleName > AccID" correspondence table file. VCFsampleName[,]AccID[\n]...
(none) : Consider VCF sample name as AccessionID.
Other options:
-h<remote host>:(default: localhost)
-z : Register GT:0/0 variant(not by default)
-t<num>: Number of threads(default: 4)
-b : The parent directory in which to create subdirectory to store temporary files.(default: /tmp)
-k : TEST mode. Check AccessionIDs, and output commands for each thread, but do not perform DB registration.
Sub-action:
-a : Append variant information to a table that already contains records.(*3) Records with the same position will be registered multiple times.
-r : Delete variant information for the specified AccessionIDs. AccessionID itself is not deleted. A multi-sample VCF previously used for DB registration must be specified with "-f".
-q : [Use with "-r"] Truncate common variant table(sys_variant)
(*1) priority is n > m, Only the AccIDs specified here will be registered.
(*2) See description in tasuke_variant_vcf.pl on this Wiki.
(*3) By default, registration is skipped for tables where records already exist.
VcfSample1,DbAccId1
VcfSample2,DbAccId2
VcfSample3,DbAccId3
......
- 3. Legacy Programs
Use these programs in the following cases:
- If the new version of the program above doesn't work properly. The older version are generally slower but more stable.
- DB mode can be either "Multi-sample VCF(mvcf)" or "Single-sample VCF(svcf)"
Command:
$ tasuke_variant_vcf.pl -db
<database name>-u<user>-p<password>-n<ID>-f<VCF file>
-t 'samtools' or 'freebayes' or 'gatk' or 'gatkm'Required:
-db<database name>: Database name for TASUKE
-u<user>: User name
-p<password>: Password for the database
-n<ID>: Destination ID (accession). For "-t gatkm", a comma-separated list of IDs
-f<variant file>: Variant infromation (.VCF)
-t 'samtools' or 'gatk' or 'gatkm' : Set the program name that generated VCF file to this section
'gatkm' means multi sample VCF file generated by GATK.
Optional:
-z : For "-t gatkm". Register GT:0/0 variant(not by default)
-h<remote host>: To connect remote host name
-r : Delete the variants from database.Apart from the above, there is a parallel wrapper script that is faster when registering large multi-sample VCF(tens to thousands of samples). Details are described later.
When you set "gatkm" for the program name(-t), Specify a comma-separated list of accessionIDs for "-n" (no spaces). The order of the accessionIDs corresponds to the order of the samples in the VCF file (sample names in the VCF are not used for DB registration). If IDs is less than the number of samples in VCF, ID is mapped from the first sample and the excess sample is ignored. If you want to ignore registering samples at the beginning or in the middle of the columns, write only commas like "-n ,,,ID1,ID2,,ID4".
A multi-sample VCF("-t gatkm") file contains a GT:0/0 variant, but it is not registered in DB by default as it will increase data size and reduce performance. Add '-z' option to register GT:0/0 variant. GT:0/0 variant will be displayed on the track in GT color mode.
When you want to input a VCF file again, you can delete it with '-r' option.If your VCF file format is multi-sample(gatkm) and the number of samples is large, we recommend the following script(legacy_tasuke_variant_vcf_multi.pl). This script simplifies operations and speeds up DB registration by executing multiple "tasuke_variant_vcf.pl" in parallel.$ tasuke_variant_vcf.pl -r -db
<database name>-u<user>-p<password>-n<ID>
"-k" option only checks the correspondence between VCF samples and DB AccessionIDs. DB registration is not performed. It is strongly recommended to perform this check before DB registration.Command:
$ legacy_tasuke_variant_vcf_multi.pl -db<database name>-u<user>-p<password>-f<VCF file>
Either required (*1):
-n<IDs>: Comma-separated AccessionID list (Corresponds to the order of VCF sample names) (*2)
-m<Path>: Path to "VCFsampleName > AccID" correspondence table file. VCFsampleName[,]AccID[\n]...
(none) : Consider VCF sample name as AccessionID.
Other options:
-h<remote host>:(default: localhost)
-z : Register GT:0/0 variant(not by default)
-t<num>: Number of threads(default: 4)
-k : TEST mode. Check AccessionIDs, and output commands for each thread, but do not perform DB registration.
Sub-action:
-a : Append variant information to a table that already contains records.(*3) Records with the same position will be registered multiple times.
-r : Delete variant information for the specified AccessionIDs. AccessionID itself is not deleted. A multi-sample VCF previously used for DB registration must be specified with "-f".
(*1) priority is n > m, Only the AccIDs specified here will be registered.
(*2) See description in tasuke_variant_vcf.pl on this Wiki.
(*3) By default, registration is skipped for tables where records already exist.