2020 Data Science Institute Student Capstone Projects

May 21, 2020

Did you know that all of Columbia’s M.S. in data science students complete capstone projects to solve business and operational problems before graduation?

Students divide into teams and use advanced artificial intelligence and machine learning techniques to tackle challenges suggested by the Data Science Institute’s industry affiliates. The companies provide mentors for the teams and the students learn additional statistical, computational, and engineering techniques from Sining Chen, an adjunct professor of industrial engineering and operations research who instructs the course.

Spring 2020 capstone participants worked mostly online due to the coronavirus pandemic. Each team gave virtual presentations of their projects and findings for their industry mentors, instructor, and classmates on May 8.

This semester’s student deliverables included an AI-assisted medical image processing app, a web-based client intelligence tool, a study of geospatial trajectory clustering algorithms, a novel approach for mapping news stories to Wikipedia, and a study about reader reaction to news articles of varying trustworthiness.

Microsoft: Matching News to Wikipedia Pages
Mentor: David Rothschild, Economist, Microsoft Research
Team Members: Ivan Ugalde, Megala Sundar Kannan, Patrick Stanton, Sean Xu

Microsoft created a pipeline that scrapes daily news and finds the most relevant events. This team scraped Wikipedia and used natural language processing techniques to create a classification model that can help to match daily events and news to specific and relevant Wikipedia pages.

General Electric: Clustering of Spatio-Temporal Trajectories for Asset Tracking
Mentor: Tapan Shah, Lead Scientist, GE Research
Team Members: Tabitha Karuna Sugumar, Fatima Koli, Jacqueline Araya, Kun Tao

With increasing use of asset tracking, large databases of spatio-temporal trajectories (STT) are used for traffic planning, inventory optimization, and understanding movements. This team developed and evaluated novel STT clustering methods with promising results. The key novelty arises from different similarity measures (DTW, Fréchet Distance, Edit Distance, LCSS) and representations (e.g. autoencoders) for STT.

Johnson & Johnson: AI-Assisted Estimation of Cup to Optic Disc Ratio from Human Retina Images in the Google Cloud Platform
Mentor: Joshua A. Young, Clinical Professor of Ophthalmology, New York University School of Medicine and Consultant, Johnson & Johnson
Team Members: Janet Catherine Prumachuk, Christine Hiu-Man Lee, Rohan Bareja, Wadood Chaudhary

Estimating the optic cup to optic disc ratio is an important measurement in glaucoma detection. Qualitative methods result in poor reproducibility, while recently developed AI models have not been widely deployed. This team developed and deployed an AI pipeline on the Google Cloud Platform for practical use in a clinical setting.

JPMorgan Chase & Co.: Know Your Client: A Client Intelligence Tool
Mentor: Naftali Cohen, Research Lead, AI Research, JPMorgan Chase & Co.
Team Members: Jesse Patrick Cahill, Thomas Causero, James Anthony DeAntonis, Ryan Owen McNally

The “Know Your Client” requirement sets a broad mandate that financial institutions must execute adequate background checks on potential clients. This team used public data, anomaly detection, and natural language processing to create a dashboard and pip-installable package to help executives learn the full news history of a potential client.

Bloomberg: Assessing the Trustworthiness of News Articles
Mentors: Daniel Preotiuc, Senior Software Engineer Team Lead, Bloomberg LP; Sarah Ita Levitan, Postdoctoral Research Scientist, Computer Science, Columbia University
Team Members: Rohit Dalal, Muf Tayebaly, Phani Valasa, Harish Babu Visweswaran

This team built text-based predictive methods to assess the trustworthiness of news articles and to study the relationship between story trustworthiness and the reactions that the stories elicit through social media.