2019 Summer Internship Projects

Projects for the 2019 Summer Internship Program are described below. Applicants will be matched to projects based on skills and interests.

GDC Dictionary GraphViz

Description: Interactive graph visualization of the GDC dictionary. The GDC dictionary defines multiple components of the GDC Data Model and relationships between them, allowing researchers to query and access data useful for their research. Currently the GDC dictionary is provided as static content via an HTML page. We would like to enable interactive graph visualization to allow a user to: 1) Select any graph components and view detailed information on the components/subgraph. 2) Download the submission templates from a selected node/path. 3) Search the dictionary and interact with the search results shown on the view and the graph, as a preferred advanced feature. A sample application of visualization can be found here: https://nci-crdc-demo.datacommons.io/DD

Required Skills: Javascript

Preferred Skills: Python, GraphViz, Github

Mentors: @kulgan & @cpreid2

Category: Fun/Peripheral

Difficulty: Easy

Interactive Genomic Graph Data Submission

Description: Create an interactive web UI that helps users visualize relationships between nodes and perform data submission. This will drive open-science and allow researchers to more easily share their data with the scientific research community across the world. You will create a prototype web UI using test data from GDC so a user can perform CRUD operations. To do this, you will create a web UI that uses the GDC dictionary to display a data graph per project based on what the user has access to. Users will be able to visualize all nodes they have access to and create/update nodes on the fly and visualize changes directly. This improve our data submission process to be much less error-prone and less tedious.

Required Skills: General programming skills

Preferred Skills: Web development experience

Mentors: @khan08 & @kulgan

Category: Risky/Exploratory

Difficulty: Medium

Develop machine learning notebooks over biomedical data

Description: Gen3 powers many different data commons and pilot projects which house a variety of data including imaging data and annotations. Researchers would like to use machine learning tools such as Tensorflow, PyTorch, and Keras over these datasets. This internship will focus on building example notebooks for these researchers to show them how to integrate Tensorflow, etc with data stored in Gen3.

Required Skills: Python

Preferred Skills: GraphQL, Machine learning

Mentors: @giangbui & @wangfan860

Category: Fun/Peripheral

Difficulty: Medium

Risk level prediction of neurological and psychiatric diseases

Description: Feature extraction for risk level prediction of neurological and psychiatric diseases and disorders using weighted fuzzy rules and neural network. Researchers will use this to improve our understanding of neurological and psychiatric disorders such as Alzheimer’s disease, Parkinson’s disease, schizophrenia, bipolar disorder and autism. The effects of these disorders can be seen in MRI data by extracting image-derived phenotypes (IDPs). The existing algorithms are very time consuming and can only get some of the IDPs. We aim to build a pipeline that can extract 3,144 IDPs automatically and run parallel. Furthermore, we can build feature vector consisting of IDPs, SNPs and clinical properties with label of specific mental disease, then train risk level prediction model using proper machine learning method like weighted fuzzy rules and neural network. This project will build a feature vector containing 3,144 image-derived phenotypes (IDPs), 11,734,353 SNPs and clinical variables (e.g. age, gender, MBI, family history of mental disease and history of alcohol use) for each patient.

Required Skills: Python, shell script, matlab

Preferred Skills: GraphQL, Neuroscience, Genomics, Machine learning

Mentors: @kuangxy3 & @zhenyuz

Category: Infrastructure/Automation

Difficulty: Medium

NLP SearchING across variables

Description: Gen3 open-source data commons are part of a larger data ecosystem that drives open-science by allowing anyone to develop a scientific application using research data from Gen3 API’s. Currently, Gen3 utilizes a data dictionary and data model schema so every submission must first be harmonized to that schema. However, researchers would prefer to be able to submit their data as they have it with the harmonization done at query time.

Required Skills: General programming skills

Preferred Skills: Python, Go, Graph Databases

Mentors: @philloooo & @abgeorge7

Category: Experimental Research

Difficulty: Medium

Gen3 devops automation

Description: Gen3 utilizes Terraform, Kubernetes, and shell scripting to perform automation for running data commons. As stewards of petabytes of patient’s biomedical research data, we have a responsibility to safeguard and uphold the security and integrity of our platform. Additional improvements to the automation process will help ensure that Gen3 runs reliably and securely. Example projects including changing Gen3 to use AWS secrets manager. We currently need a mechanism to audit changes to secrets or automatically roll secrets periodically. This project will extend our automation tools to maintain the master copy of secrets in AWS secrets manager (or Vault or similar) to audit changes to secrets, and also (bonus) set up a mechanism to automatically rotate secrets like database passwords and AWS access keys.

Required Skills: General programming skills

Preferred Skills: Experience with Kubernetes, AWS, Linux

Mentors: @fauzigo & @diw

Category: Core development

Difficulty: Medium

migrate to python3

Description: This project will enable the research infrastructure offering petabytes of biomedical data to researchers to stay up to date with the pace of modern technology advancements. To do this, you will patch code that currently only works on python2.7 to also work on python3.

Required Skills: General programming skills

Preferred Skills: Python experience

Mentors: @rudyardrichter & @avantol13

Category: Low hanging fruit

Difficulty: Medium

GDC Command Line Tool

Description: Command line tool on top of public facing GDC API. The GDC open-source data commons that fuels a larger data ecosystem, driving open-science by allowing anyone to develop a scientific application using research data from GDC API’s. The idea is to build something similar to kubernetes' kubectl. We currently have a gdc-client which is used to download/upload files, but we don't have a client that allows researchers to explore and interact with the GDC API in an easy way. As a starting point we will develop a client on top of limited APIs. This will be a beta version that includes core APIs (projects, cases, files).

Required Skills: Python

Preferred Skills: OOD, bash

Mentors: @anmaxvl & @kulgan

Category: Low Hanging Fruit

Difficulty: Easy

Script to populate bash autocompletion of command line apps from man pages

Description: Generate bash autocompletion file (/etc/bash_completion.d/*) for bash commands from man page of command line apps. Currently we depend on custom autocompletion functions for each bash command. Not all bash commands (mainly command line apps which are written in other languages like python, node etc.) have the autocompletion file. Parse man pages (or help file) to generate autocompletion file for command line apps. Command line apps use few popular packages to generate the help text. You will parse the help text into a hash to be served in autocompletion.

Required Skills: Bash, Linux

Preferred Skills: Parsers, Python

Mentors: @arizzubair & @kulgan

Category: Low hanging fruit

Difficulty: Medium

Ready to submit your application? Click below!