Collection of datasets containing the TaxaSE bacterial taxonomic annotation pipeline, SILVA insilico datasets and Illumina sequencing data from sugarcane bacterial (16S) including subhabitats from soil, rhizosphere, stem and root
  • Description

    This dataset contains the TaxaSE bacterial taxonomic annotation pipeline (including its source code and associated data files). Insilico data generated from SILVA Release 123 database is also provided here, consisting of both whole SILVA and Removal of Taxa based validation approaches, which were used to compare Shannon entropy based sequence similarity approach to Percentage Identity (via USEARCH v7.0.1090 32bit, see Edgar 2010). Lastly, the raw FASTQ files as well as processed FASTA files from Sugarcane (Saccharum Spp.) are included, consisting of samples from soil, rhizosphere, root and stem sub-habitats, alongside results generated in QIIME 1.9.1 (Caporaso et.al 2010).

    The quality of all Illumina R1 and R2 reads were assessed visually using FASTQC (Andrews 2016), merged using FLASH (Magoč & Salzberg 2011) and converted to FASTA format using QIIME’s “convert_fastaqual_fastq.py” script. Alpha diversity and beta diversity analysis were performed in QIIME, with TaxaSE results converted to QIIME compatible format for comparison. Insilico data was generated using MicroSim simulator from SILVA 123 Release database. Sugarcane leaf, stalk, root and rhizosphere soil samples were collected by Dr. Kelly Hamonts at Hawkesbury Institute for the Environment, Western Sydney University, Australia, in November 2014 from eight sugarcane fields growing three sugarcane varieties (KQ228, MQ239 and Q240) near Ingham, Queensland, Australia.

    In each field, 3 stools were randomly selected and samples were collected from 2 plants per stool. Samples were snap-frozen in liquid nitrogen on the field, transported to the laboratory on dry ice and stored at -80C. Frozen sugarcane tissue samples were ground using mortar and pestle and DNA was extracted from the resulting powder using the MoBio PowerPlant DNA extraction kit, following the manufacturer’s instructions. The MoBIO PowerSoil DNA extraction kit was used to extract DNA from the soil samples. Bacterial 16S rRNA amplicon sequencing was performed by the NGS facility at Western Sydney University using Illumina Miseq (2x 301 bp PE) and the 341F/805R primer set.


    • Data publication title Collection of datasets containing the TaxaSE bacterial taxonomic annotation pipeline, SILVA insilico datasets and Illumina sequencing data from sugarcane bacterial (16S) including subhabitats from soil, rhizosphere, stem and root
    • Description

      This dataset contains the TaxaSE bacterial taxonomic annotation pipeline (including its source code and associated data files). Insilico data generated from SILVA Release 123 database is also provided here, consisting of both whole SILVA and Removal of Taxa based validation approaches, which were used to compare Shannon entropy based sequence similarity approach to Percentage Identity (via USEARCH v7.0.1090 32bit, see Edgar 2010). Lastly, the raw FASTQ files as well as processed FASTA files from Sugarcane (Saccharum Spp.) are included, consisting of samples from soil, rhizosphere, root and stem sub-habitats, alongside results generated in QIIME 1.9.1 (Caporaso et.al 2010).

      The quality of all Illumina R1 and R2 reads were assessed visually using FASTQC (Andrews 2016), merged using FLASH (Magoč & Salzberg 2011) and converted to FASTA format using QIIME’s “convert_fastaqual_fastq.py” script. Alpha diversity and beta diversity analysis were performed in QIIME, with TaxaSE results converted to QIIME compatible format for comparison. Insilico data was generated using MicroSim simulator from SILVA 123 Release database. Sugarcane leaf, stalk, root and rhizosphere soil samples were collected by Dr. Kelly Hamonts at Hawkesbury Institute for the Environment, Western Sydney University, Australia, in November 2014 from eight sugarcane fields growing three sugarcane varieties (KQ228, MQ239 and Q240) near Ingham, Queensland, Australia.

      In each field, 3 stools were randomly selected and samples were collected from 2 plants per stool. Samples were snap-frozen in liquid nitrogen on the field, transported to the laboratory on dry ice and stored at -80C. Frozen sugarcane tissue samples were ground using mortar and pestle and DNA was extracted from the resulting powder using the MoBio PowerPlant DNA extraction kit, following the manufacturer’s instructions. The MoBIO PowerSoil DNA extraction kit was used to extract DNA from the soil samples. Bacterial 16S rRNA amplicon sequencing was performed by the NGS facility at Western Sydney University using Illumina Miseq (2x 301 bp PE) and the 341F/805R primer set.


    • Data type dataset
    • Keywords
      • NGS
      • Illumina
      • Taxonomy
      • Annotation
      • Pipeline
      • Community analysis
      • SILVA
      • Saccharum Spp
    • Funding source
      • Western Sydney University and CRC-CARE
    • Grant number(s)
    • FoR codes
      SEO codes
      Temporal (time) coverage
    • Start date 2013/02/01
    • End date 2017/02/28
    • Time period
       
      Spatial (location,mapping) coverage
    • Locations
      Data Locations

      Type Location Notes
      The Data Manager is: Ali Ijaz
      Access conditions Open
    • Related publications
        Name Ijaz, AZ, Jeffries, T, Quince, C, Hamonts K & Singh, B 2017, ‘TaxaSE: Exploiting evolutionary conservation within 16S rDNA sequences for enhanced taxonomic annotation’, Peer J Preprints. DOI: 10.7287/peerj.preprints.2941v1
      • URL https://doi.org/10.7287/peerj.preprints.2941v1
      • Notes Pre print
      • Name Taxonomic and Environmental Annotation of Bacterial 16S rDNA sequences via Shannon Entropy and Database Metadata Terms
      • URL
      • Notes PhD thesis; add when deposited in repository
      • Name Edgar, RC 2010, 'Search and clustering orders of magnitude faster than BLAST', Bioinformatics,vol. 26, no. 19, pp. 2460-2461.
      • URL
      • Notes As mentioned in Description
      • Name Caporaso, JG, Kuczynski, J, Stombaugh, J, Bittinger, K, Bushman, FD, Costello, EK et al. 2010, 'QIIME allows analysis of high-throughput community sequencing data', Nature Methods, vol. 7, no. 5, pp. 335-336.
      • URL
      • Notes As mentioned in Description
      • Name Magoč, T. & Salzberg, S 2011, ‘FLASH: Fast length adjustment of short reads to improve genome assemblies’, Bioinformatics, vol. 27, no. 21, pp. 2957-63.
      • URL
      • Notes As mentioned in Description
    • Related website
        Name Download HIE data
      • URL
      • Notes * add data url when the paper is published
      • Name HIE | Hawkesbury Institute for the Environment
      • URL
      • Notes
      • Name A quality control tool for high throughput sequence data (Andrews S. 2016)
      • URL
      • Notes
    • Related metadata (including standards, codebooks, vocabularies, thesauri, ontologies)
    • Related data
    • Related services
      Citation Ijaz, Ali; Hamonts, Kelly; Jeffries, Thomas (2017): Collection of datasets containing the TaxaSE bacterial taxonomic annotation pipeline, SILVA insilico datasets and Illumina sequencing data from sugarcane bacterial (16S) including subhabitats from soil, rhizosphere, stem and root. Western Sydney University. {ID_WILL_BE_HERE}