picardio's picture
From picardio rss RSS  subscribe Subscribe

Data Mining GenBank for Phylogenetic inference - T. Vision 



 

 
 
Tags:  data mining  datamining  genbank  phyloinformatics  bioinformatics 
Views:  2755
Published:  November 30, 2009
 
0
download

Share plick with friends Share
save to favorite
Report Abuse Report Abuse
 
Related Plicks
Data Entry, Data Entry Jobs, Data Entry Work, Freelance Data Entry ...

Data Entry, Data Entry Jobs, Data Entry Work, Freelance Data Entry ...

From: Chang674210
Views: 359 Comments: 0
Data Entry, Data Entry Jobs, Data Entry Work, Freelance Data Entry ...
We can help you find data entry, online data entry, offline data entry, data processing, data conversion, form filling, data typing, data collection & da (more)

 
Data Entry India, Outsource Data Entry Services to India, Data

Data Entry India, Outsource Data Entry Services to India, Data

From: Chang674210
Views: 318 Comments: 0
Data Entry India, Outsource Data Entry Services to India, Data
Outsource online data entry, offline data entry, eCommerce product data-entry, OCR, scanning, data capturing, data processing,
http://w (more)

 
Data Entry Services | Data Processing | Data Conversion companies

Data Entry Services | Data Processing | Data Conversion companies

From: Chang674210
Views: 382 Comments: 0
Data Entry Services | Data Processing | Data Conversion companies
Yantram Data Entry Services India, Data Processing, Data Conversion, Web-Research. Data Mining, Image Processing, OCR, OMR, ICR Is the core services of (more)

 
Axion Data Entry Services Data Entry Outsourcing - Data Entry

Axion Data Entry Services Data Entry Outsourcing - Data Entry

From: Chang674210
Views: 372 Comments: 0
Axion Data Entry Services Data Entry Outsourcing - Data Entry
Specialists in data entry services, data entry projects, Business Process Outsourcing, BPO,data capture and forms processing. We free up your facilities and (more)

 
Data Entry India | Data Entry Company India | Outsource Data Entry

Data Entry India | Data Entry Company India | Outsource Data Entry

From: Chang674210
Views: 859 Comments: 0
Data Entry India | Data Entry Company India | Outsource Data Entry
Data Entry & Processing Company India: Cignus Web Services is a Data Entry Outsourcing Company providing affordable Product Entry Services, Ecommerce Data (more)

 
Data Entry, Data-Entry, Dataentry, Online Data Entry Jobs India

Data Entry, Data-Entry, Dataentry, Online Data Entry Jobs India

From: Chang674210
Views: 436 Comments: 0
Data Entry, Data-Entry, Dataentry, Online Data Entry Jobs India
One stop solution for high quality, time bound and cost effective Outsourcing Services for. Data Entry Services, Data Processing Services, Data Conversion (more)

 
See all 
 
More from this user
Introduction Course Overview

Introduction Course Overview

From: picardio
Views: 150
Comments: 0

Hypertensive Retinopathy

Hypertensive Retinopathy

From: picardio
Views: 2966
Comments: 0

2010   cymbalta letter - ddmac letter (pdf-91 kb)

2010 cymbalta letter - ddmac letter (pdf-91 kb)

From: picardio
Views: 286
Comments: 0

Education Loans

Education Loans

From: picardio
Views: 200
Comments: 0

Demand more website:   Should I speak to the insurance company?

Demand more website: Should I speak to the insurance company?

From: picardio
Views: 367
Comments: 0

P2PR

P2PR

From: picardio
Views: 378
Comments: 0

See all 
 
 
 URL:          AddThis Social Bookmark Button
Embed Thin Player: (fits in most blogs)
Embed Full Player :
 
 

Name

Email (will NOT be shown to other users)

 

 
 
Comments: (watch)
 
 
Notes:
 
Slide 1: Prospects for enabling phylogenetically informed comparative biology on the web Todd Vision & Hilmar Lapp 1,2 1U.S. 2Dept. 1 Suppose you have the sequence of a protein-coding gene, and are interested in its function. What is the first thing you would do? • If it were me, I would search for conserved domains that match records in Pfam and other protein domain databases. • Are these databases complete? • Are they infallible? • Are they still useful? National Evolutionary Synthesis Center Of Biology, University of North Carolina at Chapel HIll Why are these data useful? • You needn’t have mastery of the specialist literature before the search • A match connects you to a vast interconnected world of information • Why not worry about completeness? ! A negative result is not expensive ! Many broadly useful records are already present • Why not worry about fallibility? ! The user can weigh the evidence once a match is found ! Assertions should be exposed to scrutiny 1
Slide 2: Some observations • This infrastructure is designed to disseminate data to non-specialists • The relevant data may be derived from multiple “studies”, not all of which are published • Data is hoarded neither by the researcher nor by the domain database • The search service is as widely disseminated as the data • Semantic-level machine-to-machine communication facilitates human comprehensive The case of phylogenetic data • There is a broad audience for phylogenetic data ! Organismal phylogeny (e.g. Encyclopedia of Life) ! Gene/protein trees • Many of the available resources are geared toward specialist researchers & students • Non-specialists turn to taxonomic classifications when they need organismal phylogenetic information • Few know where to find gene/protein trees at all TreeBase • screenshot Tree of Life Web Project 2
Slide 3: The NCBI taxonomy • Provides ! A hierarchy for all species represented by DNA sequences in Genbank ! Names and IDs for internal nodes ! An FTP dump • But does NOT ! Include unsequences species ! Report confidence in topology or monophyly ! Taxonomic nuance (it has synonyms & common names) What if the NCBI taxonomy… • Listed all taxa, including fossils? • Allowed one to assess where there are conflicting topologies? • Reported support values for clades? • Reported divergence time estimates for nodes (e.g. from TimeTree) • Reported the provenance of the data? Node-oriented web services from the Tree of Life Web Project • • • • • • • • • Name Description Authority Date Other names Completeness of children Extinction status Confidence of position Monophyly 3
Slide 4: Further barriers to dissemination of phylogenetic information • Technical obstacles ! ! ! ! ! Technology for storing and querying trees Difficulties with exchange standards Inference of consensus trees and supertrees Taxonomic intelligence Globally unique identifiers Outline • Informatics @ NESCent • An example of a phylogenetically-informed semantic web application for phenotype data • Promoting interoperability and closing technical gaps in phyloinformatics through open development • Social obstacles ! Reluctance to provide incomplete or fallible information NESCent sponsored science • Catalysis Meetings (large, one-time events) ! To foster new collaborations and synthetic research • Working Groups ! Smaller, focused, multiple meetings • Sabbatical Scholars • Postdoctoral fellows • Short-term visitor program ! 2 weeks to 3 months ! Encourage collaborative projects • Application info: http://www.nescent.org 4
Slide 5: Evolutionary Informatics WG • Organizers: Arlin Stoltzfus and Rutger Vos • Selected goals: ! XML serialization of NEXUS ! Formal grammar for validation and interconversion of NEXUS & other formats ! A transition model language for evolutionary models used in statistical inference ! An ontology for evolutionary comparative data analysis NESCent Informatics • Support for sponsored science and scientists ! Facilitating electronic collaboration ! Software/database development ! Providing HPC and other IT infrastructure • Cyberinfrastructure for synthetic science ! ! ! ! Data sharing Software interoperability Training In partnership with major national and international efforts • http://www.nescent.org/wg_evoinfo GeoPhyloBuilder “Putting the geography into phylogeography” David Kidd & Xianhua Liu Phylogenetic cyberinfrastructure to enable comparative biology • Two traditions in the recording of phenotype data ! Natural language descriptions and character matrices ! Statements made using anatomical and trait ontologies, designed to capitalize on the semantic web • NESCent WG on morphological evolution in fish ! Organized by Paula Mabee and Monte Westerfield ! Led to a larger project • Extension for ArcGIS Software that creates a spatiotemporal GIS network model from a tree with georeferenced nodes. • 3D visualizations are possible through ArcSCENE. • http://www.nescent.org/informatics/software.php • Aim is to integrate ! Mutant phenotype data for zebrafish ! Comparative morphology data for the Ostariophysi 5
Slide 6: Ontologies • Defined terms with defined relationships ! e.g. Gene Ontology, Cell Ontology part_of Describing phenotypes using ontologies • Entity-Quality system (EQ) • Entity term from an anatomy ontology ! zebrafish anatomy cell ontology, etc. cell part_of membrane is_a cell projection is_a • Quality term from Phenotype and Trait Ontology (PATO) • e.g. Entity=dorsal fin, Shape=round axolemma part_of axon Phenotype and Trait Ontology (PATO) ... chromatic property optical quality color amplitude blue green dark blue physical quality buoyancy Evolutionary character matrices • Common phenotypic data format in evolutionary biology (e.g. NEXUS) • Characters + character states, similar to EQ dorsal fin shape character 2 Species one Species two Species three bright blue round pointed undulate state state state 6
Slide 7: Character Matrix vs. EQ Character Entity Attribute dorsal fin shape Entity Character State Value round AO PATO A scenario • A geneticist observes a reduction in the number of a particular bone type (e.g. branchiostegal ray) in a zebrafish mutant of her favorite gene. • She asks: is this bone variable in number among species in nature? • She could query the evolutionary phenotype database using: ! Entity = Branchiostegal ray (from TAO) ! Qualities pertaining to attribute ‘count’ (from PATO) Quality • She could examine a visualization of the phylogenetic relationships of the taxa with the relevant character changes mapped. • She would see that most Ostariophysi have 3 rays, but that reduction has occurred multiple times: ! solenostomids and syngnathids (ghost pipefishes and pipefishes) ! giganturids ! saccopharyngoid (gulper and swallower) eels • By examining additional changes on these same branches, she sees several parallelisms: ! loss of the swimbladder, pelvic fins, and scales ! elongation of the mandibular or hyoid arches ! reduction or loss of the opercle in syngnathids and saccopharyngoids. ! a variety of other bones and soft tissues are lost or greatly modified • She might hypothesize that these trait correlations are all due to alterations in the expression of the same suite of morphogens. • She can select appropriate species from these lineages to follow-up experimentally. 7
Slide 8: What data are needed to enable this scenario? • Anatomy and trait ontologies • Phenotypes in EQ syntax for ! Zebrafish mutants (already exist) ! Species/clades of Ostariophysi Some anatomical ontologies • • • • • • Amphibia C. elegans Fish (zebrafish, medaka, teleosts) Insects (Drosophila, Mosquito, Hymenoptera) Mammals (mouse, human) Plants (Arabidopsis, cereals, maize, all plants) • Phylogenetic relationships among the Ostariophysi ! Taxonomy ontology NESCent (Vision, Lapp, Software Developers) Working groups Curator interface EQSYTE database EQSYTE public interface USD (Mabee, Data Curator) U. Oregon (Westerfield) Usability testing Liason to ZFIN Liason to NCBO Preserving published data for future integration efforts • • • • • • • Sequence alignments (e.g. Treebase) Long-term population records (e.g. pedigrees) 2D and 3D images Collection and locality information Behaviorial observations Numerical tables Etc. EQSYTE contents Zebrafish phenotypic & genetic data Morphology collaborators (Arratia, Coburn, Hilton Lunderg, Mayden) Ostariophysan phenotypic data NCBO Applications (Phenote, OBO-Edit) OBO (host of TAO, PATO, taxonomy ontology) Phenotype Ontologies for Evolutionary Biology Workshops Ontologies (taxonomy, TAO, PATO, homology) Tulane U. (Rios/Ontology Curator) Ichthyology community (DeepFin, Fishbase) Liason to CToL • Most of these data are lost upon publication • These are the stuff of comparative biology 8
Slide 9: Dryad: A digital repository for published data in evolutionary biology • • • • • • • • NCSU Digital Library Initiative Journals and societies involved so far American Naturalist (ASN) Evolution (SSE) Journal of Evolutionary Biology (ESEB) Integrative and Comparative Biology (SICB) Molecular Biology and Evolution (SMBE) Molecular Ecology Molecular Phylogenetics and Evolution Systematic Biology (SSB) Open development • Open source refers only to the licensing of the software code • At NESCent, we have been experimenting with practices in open development ! Community contributes to a shared code base ! Higher barrier to entry ! Can be a substantial payoff in terms of interoperability, functionality, usability, maintenance ! Surprisingly rare in academia 2006 Phyloinformatics Hackathon ATV NCL NESCent HyPhy PAUP* CIPRES GARLI TreeBase Bio::CDAT Biojava BioSQL JEBL Bioruby BioPerl Biopython 9
Slide 10: Hackathon mechanics • Before the meeting ! Participants and users suggested integrative workflows • At the meeting Gaps in existing toolkits were identified Subgroups collaborated on high priority targets Followed a “use case” model Subgroups and targets were allowed to be fluid Users were on hand to provide datasets, test code, provide their perspective ! Dedicated participants tasked with documentation ! ! ! ! ! • All code is open-source and deposited in established repositories Accomplishments • Sequence family evolution ! BioPerl: Support for TribeMCL, QuickTree, ClustalW, Phylip, PAML ! BioPerl & Biopython: Support for dN/dS-based tests for selection in HyPhy ! Biojava: Parser for Phylip alignment format ! BioRuby: Support for T-Coffee, MAFFT, and Phylip • Reconciling trees ! BioPerl: Support for NJTree ! Biopython: Wrapper for Softparsmap ! BioRuby: Model for phylogenetic trees and networks with graph algorithms ! BioSQL: Model for phylogenetic trees and networks with optimization methods and topological queries 10
Slide 11: • Phylogenetic inference on non-molecular characters ! BioPerl: Interoperability between Bio::Phylo and BioPerl APIs ! BioRuby: NEXUS-compliant data model and parser for PAUP and TNT results • NEXUS compliance ! Biojava: Interoperability between Biojava and JEBL ! Biojava & BioRuby: Level II-compliant NEXUS parsers ! All: ! ! ! ! • Phylogenetic footprinting ! BioPerl: Support for Footprinter, PhastCons, and using ClustalW over a sliding window ! • Estimation of divergence times ! BioPerl: Draft design of r8s wrapper Evaluated major APIs Proposed compliance levels Gathered test files exposing common errors Fixed compliance issues in NCL and Bio::NEXUS reference implementations Worked on integrating those into GARLI and BioPerl, respectively Next hackathon • Comparative Phylogenetic Methods in R • December 10-14, 2007 • Organizers: S. Kembel, H. Lapp, B. O'Meara, S. Price, T. Vision, A. Zanne • http://hackathon.nescent.org/R_Hackathon_1 • Student internships in open-source software development ! Students work with any of a large number of established OS projects ! Students and mentors work & communicate remotely • Have an idea for a future event? Submit a whitepaper! • NESCent recruited mentors and oversaw student progress ! Eleven students worked on projects in visualization, usability, interoperability & implementation of new methods 11
Slide 12: NEXML • • • • Student: Jason Caravas Mentor: Rutger Vos Flexible serialization of phylogenetic objects Perl Bio::Phylo module tools for NEXML parsing and serialization Command-line BioSQL • Student: Jamie Estill • Mentor: Hilmar Lapp • Commands for ! ! ! ! ! ! Database initialization Bio::TreeIO import Bio::TreeIO export Tree query Tree optimization Tree manipulation Conservation of phylogenetic diversity • Student: Klaas Hartmann • Mentor: Tobias Thierer • Implementation of algorithm and GUI for optimal allocation of a finite budget to individual species to maximize phylogenetic diversity. 12
Slide 13: Bayesian calibration of divergence times • Phyloinformatics Summer Course Teaching advanced programming skills to phylogenetic methods developers Focus is on software technologies rather than methodology First year ! 10 days in July 2007 ! Organized by Bill Piel of TreeBASE ! 8 co-instructors ! 23 students (11 female) in the first year • Student: Michael Nowak • Mentor: Derrick Zwickl • • • Fossil occurrence data is used to construct informative priors on divergence times for Bayesian analysis in, e.g. BEAST Conclusions • The future of web-enabled comparative biology is beginning to become clearer. ! For a preview, see genomics! Additional acknowledgements • • • • Hackathon participants GSoC mentors and students Summer course instructors Phenotype evolution project ! Jim Balhoff, Wasila Dahdul, John Lundberg, Paula Mabee, Peter Midford, Monte Westerfield • The facile exchange of phylogenetic data is what will enable it. • Expect to be using technologies such as ontologies and web services, which are now largely foreign to phylogenetic researchers. • Also expect a shift toward open development. ! This will necessitate new modes of training for academic phyloinformaticists. • Data depository: ! Ryan Scherle, Jane Greenberg 13

   
Time on Slide Time on Plick
Slides per Visit Slide Views Views by Location