Difference between revisions of "GSOC Project Ideas 2015"

From GMOD
Jump to: navigation, search
m
m
Line 2: Line 2:
  
 
Students are also encouraged to propose their own ideas related to our projects. If you have strong computer skills and have an interest in biology or bioinformatics, you should definitely apply! Do not hesitate to propose your own project idea: some of the best applications we see are by students that go this route. As long as it is relevant to one of our projects, we will give it serious consideration. Creativity and self-motivation are great traits for open source programmers.  
 
Students are also encouraged to propose their own ideas related to our projects. If you have strong computer skills and have an interest in biology or bioinformatics, you should definitely apply! Do not hesitate to propose your own project idea: some of the best applications we see are by students that go this route. As long as it is relevant to one of our projects, we will give it serious consideration. Creativity and self-motivation are great traits for open source programmers.  
 +
  
 
*Project Idea Name
 
*Project Idea Name

Revision as of 19:47, 10 February 2015

There are plenty of challenging and interesting project ideas this year. These projects include a broad set of skills, technologies and domains, such as GUIs, database integration and algorithms.

Students are also encouraged to propose their own ideas related to our projects. If you have strong computer skills and have an interest in biology or bioinformatics, you should definitely apply! Do not hesitate to propose your own project idea: some of the best applications we see are by students that go this route. As long as it is relevant to one of our projects, we will give it serious consideration. Creativity and self-motivation are great traits for open source programmers.


  • Project Idea Name
    • Goal of the idea: describe the outcome of the project idea
    • Brief description of the idea, including any relevant links, etc.
    • Languages and skills: programming language(s) to be used, plus any other particular computer science skills needed
    • Mentors: name + contact details of the lead mentor, name + contact details of backup mentor

Here is a list of the proposed project ideas for 2015:

  • Project Idea 1: Using an interpreted language to develop bioinformatics workflows
    • Brief explanation: SeqWare is a bioinformatics workflow engine that can be used to chain together the analysis of big data in genomics and bioinformatics. The current workflow language is Java, which is rather verbose.
    • Expected results: Use Groovy to hide the current rather verbose Java workflow language. Using an interpreted language also enables for rapid prototyping of workflows. The goal of this is to make scripting SeqWare feel more like shell scripting. This is a similar effort to the GATK team’s Queue, but this would leverage SeqWare. Prototype: https://github.com/larsgt/stimpy
    • Knowledge prerequisites: Java, Groovy, git
    • Skill level: Medium
    • Mentors: Lars Jorgensen, Morgan Taschuk, Pipeline team


  • Project Idea 2: Write a Foreign Data Wrapper for Postgres and BAM/VCF
    • Brief explanation: SQL is a powerful language that makes querying structured data very straightforward, and genomics produces several types of structured data. Big data from genomics usually comes in two parts: the results, stored in files, and the metadata that describe the results, usually stored in databases. For example, VCF files describe a variant in particular cancer-causing gene, and the metadata will describe what the sample was, where it came from, how it was processed, etc. We would like to use SQL to query both results and metadata together.
    • Expected results: Develop a Foreign Data Wrapper for BAM and VCF in order to query alignment and variant information. There is an existing Foreign Data Wrapper for TSV files. This should make VCF and SAM fairly straight forward. Accessing BAM files would be slightly more involved. This could provide a good example of making queries against BAM data. Info: http://www.postgresql.org/docs/9.1/static/fdwhandler.html and http://www.depesz.com/2011/03/14/waiting-for-9-1-foreign-data-wrapper/.
    • Knowledge prerequisites: PostgreSQL
    • Skill level: advanced
    • Mentors: Lars Jorgensen


  • Project Idea 3: Implement a FUSE interface to BAM/CRAM
    • Brief explanation: Storage of big data is an ongoing problem that will only get worse. As data moves through a processing pipeline in genomics, the output data is often a lossless conversion of data integrating different information (e.g. FASTQ is a listing of all reads; BAM is an alignment of those reads to a reference but still contains all of the reads from the FASTQ). However, data from earlier in the pipeline is often kept so that the analysis can be repeated with different tools. This results in a duplication of data on the order of gigabytes to terabytes.
    • Expected results: Enable a tool to see the same BAM file as either two FASTQs, interleaved FASTQ or whatever format it needs (with the same information). This should be easy to prototype using Python as fuse-python and pysam exists.
    • Knowledge prerequisites: C and/or Python, POSIX APIs
    • Skill level: advanced
    • Mentor: Lars Jorgensen


  • Project Idea 4: Use Galaxy to run SeqWare workflows and process on data
    • Brief explanation: SeqWare is a bioinformatics workflow engine that can be used to chain together the analysis of big data in genomics and bioinformatics. SeqWare is currently driven on the command line by skilled users. However, it would be incredibly useful to leverage SeqWare’s robustness and stability for individual non-expert users. Galaxy is a user-friendly mechanism for analysing data that can be used for this task.
    • Expected results: There are two potential sub-projects. 1) Adding SeqWare metadata and files as a data source in Galaxy, to enable Galaxy users to use SeqWare data, and 2) Launching and monitoring SeqWare workflows with Galaxy.
    • Knowledge prerequisites: Galaxy, Java, web services, PostgreSQL
    • Skill level: Medium
    • Mentor: Morgan Taschuk


  • Project Idea 5: Barcode scanner using phone or tablet to drive LIMS
    • Brief explanation: In a typical genomics lab, the Laboratory Information Management System (LIMS) is required to keep track of lot of people, equipment and samples as they interact. A typical LIMS requires a desktop computer and a lot of drop down menus in order to fulfill this task, which takes the technician away from the bench and introduces the potential for error. Large sequencing labs use barcodes instead. Barcode readers are prohibitively expensive for smaller labs. Cameras on phones are getting quite good, so it should be fairly easy to drive the barcode reading from a mobile device. This would be a low cost way for smaller labs to use barcoding in the lab workflows. Barcode reading library: https://github.com/zxing/zxing.
    • Expected results: A mobile LIMS application that stores a particular lab workflow and prompts the user to scan barcodes when they reach a particular step in the workflow. It would also be able to send information back to the central LIMS servers.
    • Knowledge prerequisites: iOS or Android development, web services, interface design
    • Mentor: Lars Jorgensen, Timothy Beck and Tony DeBat


  • Project Idea 6: iPython notebook on top of our infrastructure
    • Brief explanation: iPython notebook is a powerful tool. It enables reproducible science as people can share their work. It would be interesting to see how iPython notebook and SeqWare could interact. It would also be useful for OICR’s users if they could query our and other metadata using Python or R.
    • Expected result: A python library that can be used to query SeqWare’s metadata through their RESTful web service.
    • Knowledge prerequisites: Python, web services
    • Skill level: basic
    • Mentor: Timothy Beck, Lawrence Heisler, Yogi Sundaravadanam