Discussion: 1 hour
Summary of Course Content
I. Introduction to scientific data management: goals and challenges
II. Scientific data models, transformations
- Generic data exchange formats (XML)
- Specialized data/file formats (netCDF, HDF5, FASTA, Nexus)
- Tree-based data transformations (XPath, XQuery, XSLT)
- XSLT, XQueryXML model and query/transformation languages
- Database integration, query rewriting
III. Knowledge representation with ontologies
- From controlled vocabularies, taxonomies, to description logic ontologies
- Reasoning with ontologies
IV. Data integration
- Schema-mapping based approaches: Global-as-View (GAV), Local-as-View (LAV); Extensions
- Ontology-based extensions for data integration
V. Scientific Workflows
- Introduction/motivation: capturing in silico experiments as scientific workflows
- Application examples from diverse domains (e.g., bioinformatics, ecoinformatics, particle physics)
- Formal models for scientific workflows: Petri nets, Kahn process networks, Synchronous Dataflow
- Scientific workflow design paradigms: Collection-Oriented Modeling & Design (COMAD), higher order/functional programming patterns
- Data and workflow provenance models
There are two kinds of projects: implementation projects and research projects. In the former, the students will work with Java-based open source systems such as the Kepler workflow system (www.kepler-project.org) and design and implement example workflows, e.g., to create a bioinformatics workflow that connects several "bio web services". Thus, in implementation projects students work with existing software systems, but they typically will also implement project-specific extensions to that software.
For research projects, students will read one or more research papers from a list of offered research topics (e.g., scientific data integration, ontologies and knowledge representation in scientific data management, scientific workflows). Students will then need to apply the results of the research papers to a specific problem (e.g., applying a certain query rewriting algorithm to a given integration scenario and set of queries). In general, the deliverable of a research project is a technical report that summarizes and compares the results of the studied papers, and their application to the given problem. Depending on the topic, the presented algorithms might have to be implemented and applied to the given problem instance.
A selection of technical papers addressing specific topics will be used. No textbook is required.
Potential Course Overlap
There is no significant overlap with any other course.