Ideas Page 

This page lists ideas for proposed projects for the Google Summer of Code 2019 program.

GDC Dictionary GraphViz

Description: Interactive graph visualization of the GDC dictionary. The GDC dictionary is a user-friendly interface for accessing the dictionary and defines multiple components of the GDC Data Model and relationships between them. Currently the GDC dictionary is provided as static content via an HTML page. We would like to enable interactive graph visualization to allow a user to: 1) Select any graph components and view detailed information on the components/subgraph. 2) Download the submission templates from a selected node/path. 3) Search the dictionary and interact with the search results shown on the view and the graph, as a preferred advanced feature. A sample application of visualization can be found here: https://qa-niaid.planx-pla.net/DD

Required Skills: Javascript

Preferred Skills: Python, GraphViz, Github

Mentors: @kulgan & @cpreid2

Category: Fun/Peripheral

Difficulty: Easy

Interactive Genomic Graph Data Submission

Description: Create an interactive web UI that helps users visualize relationships between nodes and perform submission. You will create a prototype web UI using test data from GDC so a user can perform CRUD operations. To do this, you will create a web UI that uses the GDC dictionary to display a data graph per project based on what the user has access to. Users will be able to visualize all nodes they have access to and create/update nodes on the fly and visualize changes directly. This improve our data submission process to be much less error-prone and less tedious.

Required Skills: General programming skills

Preferred Skills: Web development experience

Mentors: @khan08 & @kulgan

Category: Risky/Exploratory

Difficulty: Medium

Advanced Graph Interactions in Gen3 Data Commons

Description: Adding more advanced graph interaction on dictionary viewer in Windmill, Gen3's data portal. Windmill's dictionary viewer uses graph to show data dictionary structure. Currently it supports basic functions like graph viewer, searching, node path finding. Based on the intuitive data-dictionary visualizations, we want to integrate more sophisticated graph-based interactions for functions like dictionary graph updating, node/link manipulation, and even file mapping.

Required Skills: Web development skills, UI/UX Design

Preferred Skills: Javascript, D3.js, Visualization, UI/UX and design

Mentors: @qingyashu & @uchicagovivi

Category: Fun/Peripheral

Difficulty: Medium

Risk level prediction of neurological and psychiatric diseases

Description: Feature extraction for risk level prediction of neurological and psychiatric diseases and disorders using weighted fuzzy rules and neural network. The effects of neurological and psychiatric disorders such as Alzheimer’s disease, Parkinson’s disease, schizophrenia, bipolar disorder and autism can be seen in MRI data by extracting image-derived phenotypes (IDPs). The existing algorithms are very time consuming and can only get some of the IDPs. We aim to build a pipeline that can extract 3,144 IDPs automatically and run parallel. Furthermore, we can build feature vector consisting of IDPs, SNPs and clinical properties with label of specific mental disease, then train risk level prediction model using proper machine learning method like weighted fuzzy rules and neural network. This project will build a feature vector containing 3,144 image-derived phenotypes (IDPs), 11,734,353 SNPs and clinical variables (e.g. age, gender, MBI, family history of mental disease and history of alcohol use) for each patient.

Required Skills: Python, shell script, matlab

Preferred Skills: GraphQL, Neuroscience, Genomics, Machine learning

Mentors: @kuangxy3 & @zhenyuz

Category: Infrastructure/Automation

Difficulty: Medium

Python Dependency Graph

Description: Being able to determine internal dependencies. basic idea is to be able to see our where our internal python libraries and applications depend on each other. The idea being we need to know where to start upgrading from python 2 to 3. Some sort of visualization to help us visualize what is depending on what. Which gives us a better idea of what our choke points are and what our holdups for upgrading our dependencies are.

Required Skills: Python, Javascript (visualization), HTML, CSS

Preferred Skills: Ability to improvise

Mentors: @jesuspguofc

Category: Fun/Peripheral

Difficulty: Hard

Cloud Backup Validator

Description: Validate files being backed up into AWS S3 buckets. The Genomic Data Commons hosts a very large volume of genomics data and backs them up into public clouds. As part of the backup process, those files must be validated after being uploaded into the cloud. The validation requires computing the MD5 hash and checking it against the original file’s MD5 hash. Once validated, the validation results must be updated to reflect success/error. Some of those files can be large in size ( up to 1TB ) therefore validating them can be challenging. Performance considerations must be taken into account when developing such a solution. This is an open-source project written in Python (or Go) that can be deployed in AWS and run efficiently. It should be configurable to access a given list of S3 buckets and generate validation output and summary reports.

Required Skills: AWS, Python

Preferred Skills: Github, Shell scripting, Linux, Go

Mentors: @singergr & @profoak

Category: Infrastructure/Automation

Difficulty: Medium

Gen3 SDK's

Description: Scripting against Gen3’s API is painful. SDK will make it easier for both internal developers and 3rd party developers. Develop client SDK's to simplify gen3 scripting in python, go, and nodejs.

Required Skills: General programming skills

Preferred Skills: Tell good jokes

Mentors: @thanh-nguyen-dang & @abgeorge7

Category: Low hanging fruit

Difficulty: Medium

Integrate Gen3 automation with AWS secrets manager

Description: Track Gen3 secrets in AWS secrets manager. We currently need a mechanism to audit changes to secrets or automatically roll secrets periodically. This project will extend our automation tools to maintain the master copy of secrets in AWS secrets manager (or Vault or similar) to audit changes to secrets, and also (bonus) set up a mechanism to automatically rotate secrets like database passwords and AWS access keys.

Required Skills: General programming skills

Preferred Skills: Experience with kubernetes, AWS, Linux

Mentors: @fauzigo & @diw

Category: Core development

Difficulty: Medium

API response change detection using SVM

Description: Using support-vector machine, perform supervised learning on labeled API responses to find the regression defects in an API. The outcome is a python module which can generate an ML model to classify the API responses. You will record the responses of API endpoints in a structured yaml file using vcrpy and perform supervised learning on the responses using a linear classifier like SVM. You will use the developed ML model to detect regression defects in the new version of API.

Required Skills: Machine Learning, Python, REST API

Preferred Skills: Regression Analysis, Classification, Data Science, Supervised Learning, Support-vector machine

Mentors: @arizzubair & @kulgan

Category: Infrastructure/Automation

Difficulty: Medium

migrate to python3

Description: Patch code that only works on python2.7 to also work on python3.

Required Skills: General programming skills

Preferred Skills: Python experience

Mentors: @rudyardrichter & @avantol13

Category: Low hanging fruit

Difficulty: Medium

GDC Command Line Tool

Description: Command line tool on top of public facing GDC API. The idea is to build something similar to kubernetes' kubectl. We currently have a gdc-client which is used to download/upload files, but we don't have a client that allows users to explore and interact with the GDC API in an easy way. As a starting point we will develop a client on top of limited APIs. This will be a beta version that includes core APIs (projects, cases, files).

Required Skills: Python

Preferred Skills: OOD, bash

Mentors: @anmaxvl & @kulgan

Category: Low Hanging Fruit

Difficulty: Easy

Script to populate bash autocompletion of command line apps from man pages

Description: Generate bash autocompletion file (/etc/bash_completion.d/*) for bash commands from man page of command line apps. Currently we depend on custom autocompletion functions for each bash command. Not all bash commands (mainly command line apps which are written in other languages like python, node etc.) have the autocompletion file. Parse man pages (or help file) to generate autocompletion file for command line apps. Command line apps use few popular packages to generate the help text. You will parse the help text into a hash to be served in autocompletion.

Required Skills: Bash, Linux

Preferred Skills: Parsers, Python

Mentors: @arizzubair & @kulgan

Category: Low hanging fruit

Difficulty: Medium

Utilize Bit for Oncojs

Description: React Component Registry for Oncojs. Currently all data visualizations we utilize from the oncojs github organization are installed through NPM. There's a newer way to gain reuse for various components (be they visual or HOCs) called bit. Bit is a component registry. Simplify the reuse of our existing components. The best way to test them is to increase their utilization across various projects.

Required Skills: javascript, react, html, node

Preferred Skills: bit, containers, openstack

Mentors: @jbarno

Category: Infrastructure/Automation

Difficulty: Medium

GDC Dev Docs

Description: Documentation for developers can be built from existing code comments. We have been using Sphinx for documentation that powers the readthedocs.*.org sites in a few places within the GDC. We would like to host these docs in a public space to encourage open-source contributions. We will start working on a consistent methodology for building documentation added to any relevant README.md or rst files. Ultimately we want a place to host our documentation as it evolves.

Required Skills: Python, git

Preferred Skills: Text encoding, documentation systems

Mentor: @jbarno

Category: Fun/Peripheral

Difficulty: Medium