CodeAvail

21 Interesting Data Science Capstone Project Ideas [2024]

data science capstone project ideas

Data science, encompassing the analysis and interpretation of data, stands as a cornerstone of modern innovation. 

Capstone projects in data science education play a pivotal role, offering students hands-on experience to apply theoretical concepts in practical settings. 

These projects serve as a culmination of their learning journey, providing invaluable opportunities for skill development and problem-solving. 

Our blog is dedicated to guiding prospective students through the selection process of data science capstone project ideas. It offers curated ideas and insights to help them embark on a fulfilling educational experience. 

Join us as we navigate the dynamic world of data science, empowering students to thrive in this exciting field.

Data Science Capstone Project: A Comprehensive Overview

Table of Contents

Data science capstone projects are an essential component of data science education, providing students with the opportunity to apply their knowledge and skills to real-world problems. 

Capstone projects challenge students to acquire and analyze data to solve real-world problems. These projects are designed to test students’ skills in data visualization, probability, inference and modeling, data wrangling, data organization, regression, and machine learning. 

In addition, capstone projects are conducted with industry, government, and academic partners, and most projects are sponsored by an organization. 

The projects are drawn from real-world problems, and students work in teams consisting of two to four students and a faculty advisor. 

However, the goal of the capstone project is to create a usable/public data product that can be used to show students’ skills to potential employers. 

Best Data Science Capstone Project Ideas – According to Skill Level

Data science capstone projects are a great way to showcase your skills and apply what you’ve learned in a real-world context. Here are some project ideas categorized by skill level:

best data science capstone project ideas - according to skill level

Beginner-Level Data Science Capstone Project Ideas

beginner-level data science capstone project ideas

1. Exploratory Data Analysis (EDA) on a Dataset

Start by analyzing a dataset of your choice and exploring its characteristics, trends, and relationships. Practice using basic statistical techniques and visualization tools to gain insights and present your findings clearly and understandably.

2. Predictive Modeling with Linear Regression

Build a simple linear regression model to predict a target variable based on one or more input features. Learn about model evaluation techniques such as mean squared error and R-squared, and interpret the results to make meaningful predictions.

3. Classification with Decision Trees

Use decision tree algorithms to classify data into distinct categories. Learn how to preprocess data, train a decision tree model, and evaluate its performance using metrics like accuracy, precision, and recall. Apply your model to practical scenarios like predicting customer churn or classifying spam emails.

4. Clustering with K-Means

Explore unsupervised learning by applying the K-Means algorithm to group similar data points together. Practice feature scaling and model evaluation to identify meaningful clusters within your dataset. Apply your clustering model to segment customers or analyze patterns in market data.

5. Sentiment Analysis on Text Data

Dive into natural language processing (NLP) by analyzing text data to determine sentiment polarity (positive, negative, or neutral). 

Learn about tokenization, text preprocessing, and sentiment analysis techniques using libraries like NLTK or spaCy. Apply your skills to analyze product reviews or social media comments.

6. Time Series Forecasting

Predict future trends or values based on historical time series data. Learn about time series decomposition, trend analysis, and seasonal patterns using methods like ARIMA or exponential smoothing. Apply your forecasting skills to predict stock prices, weather patterns, or sales trends.

7. Image Classification with Convolutional Neural Networks (CNNs)

Explore deep learning concepts by building a basic CNN model to classify images into different categories. 

Learn about convolutional layers, pooling, and fully connected layers, and experiment with different architectures to improve model performance. Apply your CNN model to tasks like recognizing handwritten digits or classifying images of animals.

Intermediate-Level Data Science Capstone Project Ideas

intermediate-level data science capstone project ideas

8. Customer Segmentation and Market Basket Analysis

Utilize advanced clustering techniques to segment customers based on their purchasing behavior. Conduct market basket analysis to identify frequent item associations and recommend personalized product suggestions. 

Implement techniques like the Apriori algorithm or association rules mining to uncover valuable insights for targeted marketing strategies.

9. Time Series Anomaly Detection

Apply anomaly detection algorithms to identify unusual patterns or outliers in time series data. Utilize techniques such as moving average, Z-score, or autoencoders to detect anomalies in various domains, including finance, IoT sensors, or network traffic. 

Develop robust anomaly detection models to enhance data security and predictive maintenance.

10. Recommendation System Development

Build a recommendation engine to suggest personalized items or content to users based on their preferences and behavior. Implement collaborative filtering, content-based filtering, or hybrid recommendation approaches to improve user engagement and satisfaction. 

Evaluate the performance of your recommendation system using metrics like precision, recall, and mean average precision.

11. Natural Language Processing for Topic Modeling

Dive deeper into NLP by exploring topic modeling techniques to extract meaningful topics from text data. 

Implement algorithms like Latent Dirichlet Allocation (LDA) or Non-Negative Matrix Factorization (NMF) to identify hidden themes or subjects within large text corpora. Apply topic modeling to analyze customer feedback, news articles, or academic papers.

12. Fraud Detection in Financial Transactions

Develop a fraud detection system using machine learning algorithms to identify suspicious activities in financial transactions. Utilize supervised learning techniques such as logistic regression, random forests, or gradient boosting to classify transactions as fraudulent or legitimate. 

Employ feature engineering and model evaluation to improve fraud detection accuracy and minimize false positives.

13. Predictive Maintenance for Industrial Equipment

Implement predictive maintenance techniques to anticipate equipment failures and prevent costly downtime. 

Analyze sensor data from machinery using machine learning algorithms like support vector machines or recurrent neural networks to predict when maintenance is required. Optimize maintenance schedules to minimize downtime and maximize operational efficiency.

14. Healthcare Data Analysis and Disease Prediction

Utilize healthcare datasets to analyze patient demographics, medical history, and diagnostic tests to predict the likelihood of disease occurrence or progression. 

Apply machine learning algorithms such as logistic regression, decision trees, or support vector machines to develop predictive models for diseases like diabetes, cancer, or heart disease. Evaluate model performance using metrics like sensitivity, specificity, and area under the ROC curve.

Advanced Level Data Science Capstone Project Ideas

advanced level data science capstone project ideas

15. Deep Learning for Image Generation

Explore generative adversarial networks (GANs) or variational autoencoders (VAEs) to generate realistic images from scratch. Experiment with architectures like DCGAN or StyleGAN to create high-resolution images of faces, landscapes, or artwork. 

Evaluate image quality and diversity using perceptual metrics and human judgment.

16. Reinforcement Learning for Game Playing

Implement reinforcement learning algorithms like deep Q-learning or policy gradients to train agents to play complex games like Atari or board games. 

Experiment with exploration-exploitation strategies and reward-shaping techniques to improve agent performance and achieve superhuman levels of gameplay.

17. Anomaly Detection in Streaming Data

Develop real-time anomaly detection systems to identify abnormal behavior in streaming data streams such as network traffic, sensor readings, or financial transactions. 

Utilize online learning algorithms like streaming k-means or Isolation Forest to detect anomalies and trigger timely alerts for intervention.

18. Multi-Modal Sentiment Analysis

Extend sentiment analysis to incorporate multiple modalities such as text, images, and audio to capture rich emotional expressions. 

However, utilize deep learning architectures like multimodal transformers or fusion models to analyze sentiment across different modalities and improve understanding of complex human emotions.

19. Graph Neural Networks for Social Network Analysis

Apply graph neural networks (GNNs) to model and analyze complex relational data in social networks. Use techniques like graph convolutional networks (GCNs) or graph attention networks (GATs) to learn node embeddings and predict node properties such as community detection or influential users.

20. Time Series Forecasting with Deep Learning

Explore advanced deep learning architectures like long short-term memory (LSTM) networks or transformer-based models for time series forecasting. 

Utilize attention mechanisms and multi-horizon forecasting to capture long-term dependencies and improve prediction accuracy in dynamic and volatile environments.

21. Adversarial Robustness in Machine Learning

Investigate techniques to improve the robustness of machine learning models against adversarial attacks. 

Explore methods like adversarial training, defensive distillation, or certified robustness to mitigate vulnerabilities and ensure model reliability in adversarial perturbations, particularly in critical applications like autonomous vehicles or healthcare.

These project ideas cater to various skill levels in data science, ranging from beginners to experts. Choose a project that aligns with your interests and skill level, and don’t hesitate to experiment and learn along the way!

Factors to Consider When Choosing a Data Science Capstone Project

Choosing the right data science capstone project is crucial for your learning experience and effectively showcasing your skills. Here are some factors to consider when selecting a data science capstone project:

Personal Interest

Select a project that aligns with your passions and career goals to stay motivated and engaged throughout the process.

Data Availability

Ensure access to relevant and sufficient data to complete the project and draw meaningful insights effectively.

Complexity Level

Consider your current skill level and choose a project that challenges you without overwhelming you, allowing for growth and learning.

Real-World Impact

Aim for projects with practical applications or societal relevance to showcase your ability to solve tangible problems.

Resource Requirements

Evaluate the availability of resources such as time, computing power, and software tools needed to execute the project successfully.

Mentorship and Support

Seek projects with opportunities for guidance and feedback from mentors or peers to enhance your learning experience.

Novelty and Innovation

Explore projects that push boundaries and explore new techniques or approaches to demonstrate creativity and originality in your work.

Tips for Successfully Completing a Data Science Capstone Project

Successfully completing a data science capstone project requires careful planning, effective execution, and strong communication skills. Here are some tips to help you navigate through the process:

  • Plan and Prioritize: Break down the project into manageable tasks and create a timeline to stay organized and focused.
  • Understand the Problem: Clearly define the project objectives, requirements, and expected outcomes before analyzing.
  • Explore and Experiment: Experiment with different methodologies, algorithms, and techniques to find the most suitable approach.
  • Document and Iterate: Document your process, results, and insights thoroughly, and iterate on your analyses based on feedback and new findings.
  • Collaborate and Seek Feedback: Collaborate with peers, mentors, and stakeholders, actively seeking feedback to improve your work and decision-making.
  • Practice Communication: Communicate your findings effectively through clear visualizations, reports, and presentations tailored to your audience’s understanding.
  • Reflect and Learn: Reflect on your challenges, successes, and lessons learned throughout the project to inform your future endeavors and continuous improvement.

By following these tips, you can successfully navigate the data science capstone project and demonstrate your skills and expertise in the field.

Wrapping Up

In wrapping up, data science capstone project ideas are invaluable in bridging the gap between theory and practice, offering students a chance to apply their knowledge in real-world scenarios.

They are a cornerstone of data science education, fostering critical thinking, problem-solving, and practical skills development. 

As you embark on your journey, don’t hesitate to explore diverse and challenging project ideas. Embrace the opportunity to push boundaries, innovate, and make meaningful contributions to the field. 

Share your insights, challenges, and successes with others, and invite fellow enthusiasts to exchange ideas and experiences. 

1. What is the purpose of a data science capstone project?

A data science capstone project serves as a culmination of a student’s learning experience, allowing them to apply their knowledge and skills to solve real-world problems in the field of data science. It provides hands-on experience and showcases their ability to analyze data, derive insights, and communicate findings effectively.

2. What are some examples of data science capstone projects?

Data science capstone projects can cover a wide range of topics and domains, including predictive modeling, natural language processing, image classification, recommendation systems, and more. Examples may include analyzing customer behavior, predicting stock prices, sentiment analysis on social media data, or detecting anomalies in financial transactions.

3. How long does it typically take to complete a data science capstone project?

The duration of a data science capstone project can vary depending on factors such as project complexity, available resources, and individual pace. Generally, it may take several weeks to several months to complete a project, including tasks such as data collection, preprocessing, analysis, modeling, and presentation of findings.

Related Posts

Science Fair Project Ideas For 6th Graders

Science Fair Project Ideas For 6th Graders

When it comes to Science Fair Project Ideas For 6th Graders, the possibilities are endless! These projects not only help students develop essential skills, such…

Java Project Ideas For Beginners

Java Project Ideas for Beginners

Java is one of the most popular programming languages. It is used for many applications, from laptops to data centers, gaming consoles, scientific supercomputers, and…

jamiefosterscience logo

10 Unique Data Science Capstone Project Ideas

A capstone project is a culminating assignment that allows students to demonstrate the skills and knowledge they’ve acquired throughout their degree program. For data science students, it’s a chance to tackle a substantial real-world data problem.

If you’re short on time, here’s a quick answer to your question: Some great data science capstone ideas include analyzing health trends, building a predictive movie recommendation system, optimizing traffic patterns, forecasting cryptocurrency prices, and more .

In this comprehensive guide, we will explore 10 unique capstone project ideas for data science students. We’ll overview potential data sources, analysis methods, and practical applications for each idea.

Whether you want to work with social media datasets, geospatial data, or anything in between, you’re sure to find an interesting capstone topic.

Project Idea #1: Analyzing Health Trends

When it comes to data science capstone projects, analyzing health trends is an intriguing idea that can have a significant impact on public health. By leveraging data from various sources, data scientists can uncover valuable insights that can help improve healthcare outcomes and inform policy decisions.

Data Sources

There are several data sources that can be used to analyze health trends. One of the most common sources is electronic health records (EHRs), which contain a wealth of information about patient demographics, medical history, and treatment outcomes.

Other sources include health surveys, wearable devices, social media, and even environmental data.

Analysis Approaches

When analyzing health trends, data scientists can employ a variety of analysis approaches. Descriptive analysis can provide a snapshot of current health trends, such as the prevalence of certain diseases or the distribution of risk factors.

Predictive analysis can be used to forecast future health outcomes, such as predicting disease outbreaks or identifying individuals at high risk for certain conditions. Machine learning algorithms can be trained to identify patterns and make accurate predictions based on large datasets.

Applications

The applications of analyzing health trends are vast and far-reaching. By understanding patterns and trends in health data, policymakers can make informed decisions about resource allocation and public health initiatives.

Healthcare providers can use these insights to develop personalized treatment plans and interventions. Researchers can uncover new insights into disease progression and identify potential targets for intervention.

Ultimately, analyzing health trends has the potential to improve overall population health and reduce healthcare costs.

Project Idea #2: Movie Recommendation System

When developing a movie recommendation system, there are several data sources that can be used to gather information about movies and user preferences. One popular data source is the MovieLens dataset, which contains a large collection of movie ratings provided by users.

Another source is IMDb, a trusted website that provides comprehensive information about movies, including user ratings and reviews. Additionally, streaming platforms like Netflix and Amazon Prime also provide access to user ratings and viewing history, which can be valuable for building an accurate recommendation system.

There are several analysis approaches that can be employed to build a movie recommendation system. One common approach is collaborative filtering, which uses user ratings and preferences to identify patterns and make recommendations based on similar users’ preferences.

Another approach is content-based filtering, which analyzes the characteristics of movies (such as genre, director, and actors) to recommend similar movies to users. Hybrid approaches that combine both collaborative and content-based filtering techniques are also popular, as they can provide more accurate and diverse recommendations.

A movie recommendation system has numerous applications in the entertainment industry. One application is to enhance the user experience on streaming platforms by providing personalized movie recommendations based on individual preferences.

This can help users discover new movies they might enjoy and improve overall satisfaction with the platform. Additionally, movie recommendation systems can be used by movie production companies to analyze user preferences and trends, aiding in the decision-making process for creating new movies.

Finally, movie recommendation systems can also be utilized by movie critics and reviewers to identify movies that are likely to be well-received by audiences.

For more information on movie recommendation systems, you can visit https://www.kaggle.com/rounakbanik/movie-recommender-systems or https://www.researchgate.net/publication/221364567_A_new_movie_recommendation_system_for_large-scale_data .

Project Idea #3: Optimizing Traffic Patterns

When it comes to optimizing traffic patterns, there are several data sources that can be utilized. One of the most prominent sources is real-time traffic data collected from various sources such as GPS devices, traffic cameras, and mobile applications.

This data provides valuable insights into the current traffic conditions, including congestion, accidents, and road closures. Additionally, historical traffic data can also be used to identify recurring patterns and trends in traffic flow.

Other data sources that can be used include weather data, which can help in understanding how weather conditions impact traffic patterns, and social media data, which can provide information about events or incidents that may affect traffic.

Optimizing traffic patterns requires the use of advanced data analysis techniques. One approach is to use machine learning algorithms to predict traffic patterns based on historical and real-time data.

These algorithms can analyze various factors such as time of day, day of the week, weather conditions, and events to predict traffic congestion and suggest alternative routes.

Another approach is to use network analysis to identify bottlenecks and areas of congestion in the road network. By analyzing the flow of traffic and identifying areas where traffic slows down or comes to a halt, transportation authorities can make informed decisions on how to optimize traffic flow.

The optimization of traffic patterns has numerous applications and benefits. One of the main benefits is the reduction of traffic congestion, which can lead to significant time and fuel savings for commuters.

By optimizing traffic patterns, transportation authorities can also improve road safety by reducing the likelihood of accidents caused by congestion.

Additionally, optimizing traffic patterns can have positive environmental impacts by reducing greenhouse gas emissions. By minimizing the time spent idling in traffic, vehicles can operate more efficiently and emit fewer pollutants.

Furthermore, optimizing traffic patterns can have economic benefits by improving the flow of goods and services. Efficient traffic patterns can reduce delivery times and increase productivity for businesses.

Project Idea #4: Forecasting Cryptocurrency Prices

With the growing popularity of cryptocurrencies like Bitcoin and Ethereum, forecasting their prices has become an exciting and challenging task for data scientists. This project idea involves using historical data to predict future price movements and trends in the cryptocurrency market.

When working on this project, data scientists can gather cryptocurrency price data from various sources such as cryptocurrency exchanges, financial websites, or APIs. Websites like CoinMarketCap (https://coinmarketcap.com/) provide comprehensive data on various cryptocurrencies, including historical price data.

Additionally, platforms like CryptoCompare (https://www.cryptocompare.com/) offer real-time and historical data for different cryptocurrencies.

To forecast cryptocurrency prices, data scientists can employ various analysis approaches. Some common techniques include:

  • Time Series Analysis: This approach involves analyzing historical price data to identify patterns, trends, and seasonality in cryptocurrency prices. Techniques like moving averages, autoregressive integrated moving average (ARIMA), or exponential smoothing can be used to make predictions.
  • Machine Learning: Machine learning algorithms, such as random forests, support vector machines, or neural networks, can be trained on historical cryptocurrency data to predict future price movements. These algorithms can consider multiple variables, such as trading volume, market sentiment, or external factors, to make accurate predictions.
  • Sentiment Analysis: This approach involves analyzing social media sentiment and news articles related to cryptocurrencies to gauge market sentiment. By considering the collective sentiment, data scientists can predict how positive or negative sentiment can impact cryptocurrency prices.

Forecasting cryptocurrency prices can have several practical applications:

  • Investment Decision Making: Accurate price forecasts can help investors make informed decisions when buying or selling cryptocurrencies. By considering the predicted price movements, investors can optimize their investment strategies and potentially maximize their returns.
  • Trading Strategies: Traders can use price forecasts to develop trading strategies, such as trend following or mean reversion. By leveraging predicted price movements, traders can make profitable trades in the volatile cryptocurrency market.
  • Risk Management: Cryptocurrency price forecasts can help individuals and organizations manage their risk exposure. By understanding potential price fluctuations, risk management strategies can be implemented to mitigate losses.

Project Idea #5: Predicting Flight Delays

One interesting and practical data science capstone project idea is to create a model that can predict flight delays. Flight delays can cause a lot of inconvenience for passengers and can have a significant impact on travel plans.

By developing a predictive model, airlines and travelers can be better prepared for potential delays and take appropriate actions.

To create a flight delay prediction model, you would need to gather relevant data from various sources. Some potential data sources include:

  • Flight data from airlines or aviation organizations
  • Weather data from meteorological agencies
  • Historical flight delay data from airports

By combining these different data sources, you can build a comprehensive dataset that captures the factors contributing to flight delays.

Once you have collected the necessary data, you can employ different analysis approaches to predict flight delays. Some common approaches include:

  • Machine learning algorithms such as decision trees, random forests, or neural networks
  • Time series analysis to identify patterns and trends in flight delay data
  • Feature engineering to extract relevant features from the dataset

By applying these analysis techniques, you can develop a model that can accurately predict flight delays based on the available data.

The applications of a flight delay prediction model are numerous. Airlines can use the model to optimize their operations, improve scheduling, and minimize disruptions caused by delays. Travelers can benefit from the model by being alerted in advance about potential delays and making necessary adjustments to their travel plans.

Additionally, airports can use the model to improve resource allocation and manage passenger flow during periods of high delay probability. Overall, a flight delay prediction model can significantly enhance the efficiency and customer satisfaction in the aviation industry.

Project Idea #6: Fighting Fake News

With the rise of social media and the easy access to information, the spread of fake news has become a significant concern. Data science can play a crucial role in combating this issue by developing innovative solutions.

Here are some aspects to consider when working on a project that aims to fight fake news.

When it comes to fighting fake news, having reliable data sources is essential. There are several trustworthy platforms that provide access to credible news articles and fact-checking databases. Websites like Snopes and FactCheck.org are good starting points for obtaining accurate information.

Additionally, social media platforms such as Twitter and Facebook can be valuable sources for analyzing the spread of misinformation.

One approach to analyzing fake news is by utilizing natural language processing (NLP) techniques. NLP can help identify patterns and linguistic cues that indicate the presence of misleading information.

Sentiment analysis can also be employed to determine the emotional tone of news articles or social media posts, which can be an indicator of potential bias or misinformation.

Another approach is network analysis, which focuses on understanding how information spreads through social networks. By analyzing the connections between users and the content they share, it becomes possible to identify patterns of misinformation dissemination.

Network analysis can also help in identifying influential sources and detecting coordinated efforts to spread fake news.

The applications of a project aiming to fight fake news are numerous. One possible application is the development of a browser extension or a mobile application that provides users with real-time fact-checking information.

This tool could flag potentially misleading articles or social media posts and provide users with accurate information to help them make informed decisions.

Another application could be the creation of an algorithm that automatically identifies fake news articles and separates them from reliable sources. This algorithm could be integrated into news aggregation platforms to help users distinguish between credible and non-credible information.

Project Idea #7: Analyzing Social Media Sentiment

Social media platforms have become a treasure trove of valuable data for businesses and researchers alike. When analyzing social media sentiment, there are several data sources that can be tapped into. The most popular ones include:

  • Twitter: With its vast user base and real-time nature, Twitter is often the go-to platform for sentiment analysis. Researchers can gather tweets containing specific keywords or hashtags to analyze the sentiment of a particular topic.
  • Facebook: Facebook offers rich data for sentiment analysis, including posts, comments, and reactions. Analyzing the sentiment of Facebook posts can provide valuable insights into user opinions and preferences.
  • Instagram: Instagram’s visual nature makes it an interesting platform for sentiment analysis. By analyzing the comments and captions on Instagram posts, researchers can gain insights into the sentiment associated with different images or topics.
  • Reddit: Reddit is a popular platform for discussions on various topics. By analyzing the sentiment of comments and posts on specific subreddits, researchers can gain insights into the sentiment of different communities.

These are just a few examples of the data sources that can be used for analyzing social media sentiment. Depending on the research goals, other platforms such as LinkedIn, YouTube, and TikTok can also be explored.

When it comes to analyzing social media sentiment, there are various approaches that can be employed. Some commonly used analysis techniques include:

  • Lexicon-based analysis: This approach involves using predefined sentiment lexicons to assign sentiment scores to words or phrases in social media posts. By aggregating these scores, researchers can determine the overall sentiment of a post or a collection of posts.
  • Machine learning: Machine learning algorithms can be trained to classify social media posts into positive, negative, or neutral sentiment categories. These algorithms learn from labeled data and can make predictions on new, unlabeled data.
  • Deep learning: Deep learning techniques, such as recurrent neural networks (RNNs) or convolutional neural networks (CNNs), can be used to capture the complex patterns and dependencies in social media data. These models can learn to extract sentiment information from textual or visual content.

It is important to note that the choice of analysis approach depends on the specific research objectives, available resources, and the nature of the social media data being analyzed.

Analyzing social media sentiment has a wide range of applications across different industries. Here are a few examples:

  • Brand reputation management: By analyzing social media sentiment, businesses can monitor and manage their brand reputation. They can identify potential issues, respond to customer feedback, and take proactive measures to maintain a positive image.
  • Market research: Social media sentiment analysis can provide valuable insights into consumer opinions and preferences. Businesses can use this information to understand market trends, identify customer needs, and develop targeted marketing strategies.
  • Customer feedback analysis: Social media sentiment analysis can help businesses understand customer satisfaction levels and identify areas for improvement. By analyzing sentiment in customer feedback, companies can make data-driven decisions to enhance their products or services.
  • Public opinion analysis: Researchers can analyze social media sentiment to study public opinion on various topics, such as political events, social issues, or product launches. This information can be used to understand public sentiment, predict trends, and inform decision-making.

These are just a few examples of how analyzing social media sentiment can be applied in real-world scenarios. The insights gained from sentiment analysis can help businesses and researchers make informed decisions, improve customer experience, and drive innovation.

Project Idea #8: Improving Online Ad Targeting

Improving online ad targeting involves analyzing various data sources to gain insights into users’ preferences and behaviors. These data sources may include:

  • Website analytics: Gathering data from websites to understand user engagement, page views, and click-through rates.
  • Demographic data: Utilizing information such as age, gender, location, and income to create targeted ad campaigns.
  • Social media data: Extracting data from platforms like Facebook, Twitter, and Instagram to understand users’ interests and online behavior.
  • Search engine data: Analyzing search queries and user behavior on search engines to identify intent and preferences.

By combining and analyzing these diverse data sources, data scientists can gain a comprehensive understanding of users and their ad preferences.

To improve online ad targeting, data scientists can employ various analysis approaches:

  • Segmentation analysis: Dividing users into distinct groups based on shared characteristics and preferences.
  • Collaborative filtering: Recommending ads based on users with similar preferences and behaviors.
  • Predictive modeling: Developing algorithms to predict users’ likelihood of engaging with specific ads.
  • Machine learning: Utilizing algorithms that can continuously learn from user interactions to optimize ad targeting.

These analysis approaches help data scientists uncover patterns and insights that can enhance the effectiveness of online ad campaigns.

Improved online ad targeting has numerous applications:

  • Increased ad revenue: By delivering more relevant ads to users, advertisers can expect higher click-through rates and conversions.
  • Better user experience: Users are more likely to engage with ads that align with their interests, leading to a more positive browsing experience.
  • Reduced ad fatigue: By targeting ads more effectively, users are less likely to feel overwhelmed by irrelevant or repetitive advertisements.
  • Maximized ad budget: Advertisers can optimize their budget by focusing on the most promising target audiences.

Project Idea #9: Enhancing Customer Segmentation

Enhancing customer segmentation involves gathering relevant data from various sources to gain insights into customer behavior, preferences, and demographics. Some common data sources include:

  • Customer transaction data
  • Customer surveys and feedback
  • Social media data
  • Website analytics
  • Customer support interactions

By combining data from these sources, businesses can create a comprehensive profile of their customers and identify patterns and trends that will help in improving their segmentation strategies.

There are several analysis approaches that can be used to enhance customer segmentation:

  • Clustering: Using clustering algorithms to group customers based on similar characteristics or behaviors.
  • Classification: Building predictive models to assign customers to different segments based on their attributes.
  • Association Rule Mining: Identifying relationships and patterns in customer data to uncover hidden insights.
  • Sentiment Analysis: Analyzing customer feedback and social media data to understand customer sentiment and preferences.

These analysis approaches can be used individually or in combination to enhance customer segmentation and create more targeted marketing strategies.

Enhancing customer segmentation can have numerous applications across industries:

  • Personalized marketing campaigns: By understanding customer preferences and behaviors, businesses can tailor their marketing messages to individual customers, increasing the likelihood of engagement and conversion.
  • Product recommendations: By segmenting customers based on their purchase history and preferences, businesses can provide personalized product recommendations, leading to higher customer satisfaction and sales.
  • Customer retention: By identifying at-risk customers and understanding their needs, businesses can implement targeted retention strategies to reduce churn and improve customer loyalty.
  • Market segmentation: By identifying distinct customer segments, businesses can develop tailored product offerings and marketing strategies for each segment, maximizing the effectiveness of their marketing efforts.

Project Idea #10: Building a Chatbot

A chatbot is a computer program that uses artificial intelligence to simulate human conversation. It can interact with users in a natural language through text or voice. Building a chatbot can be an exciting and challenging data science capstone project.

It requires a combination of natural language processing, machine learning, and programming skills.

When building a chatbot, data sources play a crucial role in training and improving its performance. There are various data sources that can be used:

  • Chat logs: Analyzing existing chat logs can help in understanding common user queries, responses, and patterns. This data can be used to train the chatbot on how to respond to different types of questions and scenarios.
  • Knowledge bases: Integrating a knowledge base can provide the chatbot with a wide range of information and facts. This can be useful in answering specific questions or providing detailed explanations on certain topics.
  • APIs: Utilizing APIs from different platforms can enhance the chatbot’s capabilities. For example, integrating a weather API can allow the chatbot to provide real-time weather information based on user queries.

There are several analysis approaches that can be used to build an efficient and effective chatbot:

  • Natural Language Processing (NLP): NLP techniques enable the chatbot to understand and interpret user queries. This involves tasks such as tokenization, part-of-speech tagging, named entity recognition, and sentiment analysis.
  • Intent recognition: Identifying the intent behind user queries is crucial for providing accurate responses. Machine learning algorithms can be trained to classify user intents based on the input text.
  • Contextual understanding: Chatbots need to understand the context of the conversation to provide relevant and meaningful responses. Techniques such as sequence-to-sequence models or attention mechanisms can be used to capture contextual information.

Chatbots have a wide range of applications in various industries:

  • Customer support: Chatbots can be used to handle customer queries and provide instant support. They can assist with common troubleshooting issues, answer frequently asked questions, and escalate complex queries to human agents when necessary.
  • E-commerce: Chatbots can enhance the shopping experience by assisting users in finding products, providing recommendations, and answering product-related queries.
  • Healthcare: Chatbots can be deployed in healthcare settings to provide preliminary medical advice, answer general health-related questions, and assist with appointment scheduling.

Building a chatbot as a data science capstone project not only showcases your technical skills but also allows you to explore the exciting field of artificial intelligence and natural language processing.

It can be a great opportunity to create a practical and useful tool that can benefit users in various domains.

Completing an in-depth capstone project is the perfect way for data science students to demonstrate their technical skills and business acumen. This guide outlined 10 unique project ideas spanning industries like healthcare, transportation, finance, and more.

By identifying the ideal data sources, analysis techniques, and practical applications for their chosen project, students can produce an impressive capstone that solves real-world problems and showcases their abilities.

Similar Posts

Is Computer Science A Natural Science?

Is Computer Science A Natural Science?

As technology continues to evolve and impact our lives in unprecedented ways, the field of computer science has taken on growing importance and visibility. But there remains some debate around how this relatively new discipline should be categorized in relation to other branches of science. If you’re short on time, here’s the quick answer: Computer…

Exponential Growth Definition In Environmental Science

Exponential Growth Definition In Environmental Science

In environmental science, exponential growth describes the rapid increase of a population over time, where the growth rate becomes faster as population size expands. This accelerating pattern can have major ecological impacts. If you’re short on time, here’s a quick exponential growth definition: Exponential growth in environmental science refers to an exponentially increasing growth rate…

Should I Study Computer Science? Analyzing The Pros And Cons

Should I Study Computer Science? Analyzing The Pros And Cons

With career opportunities in high-paying fields like software engineering and data science, it’s understandable why many students consider studying computer science. But is it ultimately the right choice? This field offers immense potential but also has downsides to weigh. In this in-depth guide, we’ll assess the key benefits and drawbacks of studying computer science to…

Political Science And Economics: The Power Duo Major

Political Science And Economics: The Power Duo Major

Navigating a double major can be a challenge, but political science and economics are a natural pair. If you’re crunched for time, here’s the lowdown: this combo covers how governments manage resources and make policy, giving you a top-down and bottom-up view of society. Keep reading to learn why this dynamic duo opens doors to…

Examining The Ways Science Studies The Natural World

Examining The Ways Science Studies The Natural World

Science touches nearly every aspect of our modern lives, from the technology we use to the understanding we have of our own planet and universe. But what exactly is science, and how does it go about studying the natural world? Science can seem mystifying to many of us who aren’t scientists. If you’re short on…

The Top 10 Colleges For Political Science Majors

The Top 10 Colleges For Political Science Majors

For students interested in government, policymaking, and social change, majoring in political science can open the door to fulfilling and impactful careers. But not all colleges are created equal when it comes to political science programs. If you want world-class academics, experiential learning opportunities, and a launching pad to grad school or politics, choosing the…

sample capstone project for data science

Environmental Monitoring, remote sensing, cyber-physical systems, Engineers for Exploration

E4e microfaune project.

  • Group members: Jinsong Yang, Qiaochen Sun

Abstract: Nowadays, human activities such as wildfires and hunting have become the largest factor that would have serious negative effects on biodiversity. In order to deeply understand how anthropogenic activities deeply affect wildlife populations, field biologists utilize automated image classification driven by neural networks to get relevant biodiversity information from the images. However, for some small animals such as insects or birds, the camera could not work very well because of the small size of these animals. It is extremely hard for cameras to capture the movement and activities of small animals. To effectively solve this problem, passive acoustic monitoring (PAM) has become one of the most popular methods. We could utilize sounds we collect from PAM to train certain machine learning models which could tell us the fluctuation of biodiversity of all these small animals. The goal of the whole program is to test the biodiversity of these small animals (most of them are birds). However, the whole program could be divided into plenty of small parts. I and Jinsong will pay attention to the intermediate step of the program. The goal of our project is to generate subsets of audio recordings that have higher probability of vocalization of interest, which could help our labeling volunteer to save time and energy. The solutions could help us reduce down the amount of time and resources required to achieve enough training data for species-level classifiers. We perform the same thing with AID_NeurIPS_2021. Only the data is different between these two github. For this github, we use the peru data instead of Coastal_Reserve data.

  • Group members: Harsha Jagarlamudi, Kelly Kong

Eco-Acoustic Event Detection: Classifying temporal presence of birds in recorded bird vocalization audio

  • Group members: Alan Arce, Edmundo Zamora

Abstract: Leveraging "Deep Learning" methods to classify temporal presence birds in recorded bird vocalization audio. Using a hybrid CNN-RNN model, trained on audio data, in the interest of benefitting wildlife monitoring and preservation.

Pyrenote - User Profile Design & Accessible Data

  • Group members: Dylan Nelson

Abstract: Pyrenote is a project in development by a growing group of student researchers here at UCSD. It's primary purpose is to allow anyone to contribute to research by labeling data in an intuitive and accessible way. Right now it is currently being used to develop a sort of voice recognition for birds. The goal is to make an algorithm that can strongly label data (say where in the clip a bird is calling and what bird is making the call). To do this, a very vast dataset is needed to be labeled. I worked mostly on the user experience side. Allowing them to interact with their labeling in new ways, such as keeping tabs on their progress and reaching goals. Developing a User Profile page was the primary source for receiving this data and was developed iteratively as a whole new page for the site

Pyrenote Webdeveloper

  • Group members: Wesley Zhen

Abstract: The website, Pyrenote, is helping scientists track bird populations by identifying them using machine learning classifiers on publicly annotated audio recordings. I have implemented three features over the course of two academic quarters aimed at streamlining user experience and improving scalability. The added scalability will be useful for future projects as we start becoming more ambitious with the number of users we bring to the site.

Spread of Misinformation Online

Who is spreading misinformation and worries in twitter.

  • Group members: Lehan Li, Ruojia Tao

Abstract: Spread of misinformation over social media posts challenges to daily information intake and exchange. Especially under current covid 19 pandemic, the disperse of misinformation regarding to covid 19 diseases and vaccination posts threats to individuals' wellbeing's and general publish health. The people's worries also increase with misinformation such as the shortage of food and water. This spread of misinformation also provide This project seeks to investigate the spread of misinformation over social media (Twitter) under covid 19 pandemic. wo main directions are investigated in the project. The first direction is the analysis of the effect of bot users on the spread of misinformation: We want to explore what is the role that robot user plays in spreading the misinformation. Where are the bot users located in the social network. The second direction is the sentiment analysis that examines users' attitudes towards misinformation: We want to see the spread of sentiment with different places in social networks. We also mixed the two directions: What is the relationship between bot-users with positive and negative emptions? Since online social medias users form social networks, the project also seeks to investigate the effect of social network on the above two topics. Moreover, the project is also interested in exploring the change in proportion of bot users and users' attitude towards misinformation as the social network becomes more concentrated and tightly connected.

Misinformation on Reddit

  • Group members: Samuel Huang, David Aminifard

Abstract: As social media has grown in popularity, namely Reddit, its use for rapidly sharing information based on categories or topics (subreddits) has had massive implications for how people are usually exposed to information and the quality of the information they interact with. While Reddit has its benefits, e.g. providing instant access to - nearly - real time, categorized information, it has possibly played a role in worsening divisions and the spread of misinformation. Our results showed that subreddits with the highest proportions of misinformation posts tend to lean more towards politics and news. In addition, we found that despite the frequency of misinformation per subreddit, the average upvote ratio per submission seemed consistently high, which indicated that subreddits tend to be ideologically homogeneous.

The Spread of YouTube Misinformation Through Twitter

  • Group members: Alisha Sehgal, Anamika Gupta

Abstract: In our Capstone Project, we explore the spread of misinformation online. More specifically, we look at the spread of misinformation across Twitter and YouTube because of the large role these two social media platforms play in the dissemination of news and information. Our main objectives are to understand how YouTube videos contribute to spreading misinformation on Twitter, evaluate how effectively YouTube is removing misinformation and if these policies also prevent users from engaging with misinformation. We take a novel approach of analyzing tweets, YouTube video captions, and other metadata using NLP to determine the presence of misinformation and investigate how individuals interact or spread misinformation. Our research focuses on the domain of public health as this is the subject of many conspiracies, varying opinions, and fake news.

Particle Physics

Understanding higgs boson particle jets with graph neural networks.

  • Group members: Charul Sharma, Rui Lu, Bryan Ambriz

Abstract: Extending the content of last quarter of deep sets neural network, fully connected neural network classifier, adversarial deep set model and designed decorrelated tagger (DDT), we went a little bit further this quarter about picking up different layers in neural network like GENConv and EdgeConv. GENConv and EdgeConv play incredibly important roles here for boosting the performances of our basic GNN model. We also evaluated the performance of our model using ROC (Receiver-Operating Curve) curves describing AUC (Area Under the Curve). Meanwhile, based on previous experiences of project one and past project of particle physics domain, we decided to add one more section, exploratory data analysis in our project for conducting some basic theory, bootstrapping or common sense of our dataset. But we have not produced all the optimal outcomes so far even though we finished the EdgeConv part and for the following weeks, we would like to finish the GENConv and may try some other layers to find out the potential to increase the performance of our model.

Predicting a Particle's True Mass

  • Group members: Jayden Lee, Dan Ngo, Isac Lee

Abstract: The Large Hadron Collider (LHC) collides protons traveling near light speed to generate high-energy collisions. These collisions produce new particles and have led to the discovery of new elementary particles (e.g., Higgs Boson). One key information to collect from this collision event is the structure of the particle jet, which refers to a group of collective spray of decaying particles that travel in the same direction, as accurately identifying the type of these jets - QCD or signal - play a crucial role in discovery of high-energy elementary particles like Higgs particle. There are several properties that determine jet type with jet mass being one of the strongest indicators in jet type classification. A previous study jet mass estimation, called “soft drop declustering,” has been one of the most effective methods in making rough estimations on the jet mass. With this in mind, we aim to implement machine learning in jet mass estimation through various neural network architectures. With data collected and processed by CERN, we implemented a model capable of improving jet mass prediction through jet features.

Mathematical Signal Processing (compression of deep nets, or optimization for data-science/ML)

Graph neural networks, graph neural network based recommender systems for spotify playlists.

  • Group members: Benjamin Becze, Jiayun Wang, Shone Patil

Abstract: With the rise of music streaming services on the internet in the 2010’s, many have moved away from radio stations to streaming services like Spotify and Apple Music. This shift offers more specificity and personalization to users’ listening experiences, especially with the ability to create playlists of whatever songs that they wish. Oftentimes user playlists have a similar genre or theme between each song, and some streaming services like Spotify offer recommendations to expand a user’s existing playlist based on the songs in it. Using Node2vec and GraphSAGE graph neural network methods, we set out to create a recommender system for songs to add to an existing playlist by drawing information from a vast graph of songs we built from playlist co-occurrences. The result is a personalized song recommender based not only on Spotify’s community of playlist creators, but also the specific features within a song.

Dynamic Stock Industry Classification

  • Group members: Sheng Yang

Abstract: Use Graph-based Analysis to Re-classify Stocks in China A-share and Improve Markowitz Portfolio Optimization

NLP, Misinformation

Hdsi faculty exploration tool.

  • Group members: Martha Yanez, Sijie Liu, Siddhi Patel, Brian Qian

Abstract: The Halıcıoğlu Data Science Institute (HDSI) at University of California, San Diego is dedicated to the discovery of new methods and training of students and faculty to use data science to solve problems in the current world. The HDSI has several industry partners that are often searching for assistance to tackle their daily activities and need experts in different domain areas. Currently, there are around 55 professors affiliated to HDSI. They all have diverse research interests and have written numerous papers in their own fields. Our goal was to create a tool that allows HDSI to select the best fit from their faculty, based on their published work, to aid their industry partners in their specific endeavors. We did this with Natural Language Processing (NLP) by managing all the abstracts from the faculty’s published work and organizing them by topics. We will then obtained the proportion of papers of each faculty associated with each of the topics and drew a relationship between researchers and their most published topics. This will allow HDSI to personalize recommendations of faculty candidates to their industry partner’s particular job.

  • Group members: Du Xiang

AI in Healthcare, Deep Reinforcement Learning, Trustworthy Machine Learning

Improving robustness in deep fusion modeling against adversarial attacks.

  • Group members: Ayush More, Amy Nguyen

Abstract: Autonomous vehicles rely heavily on deep fusion modeling, which utilize multiple inputs for its inferences and decision making. By using the data from these inputs, the deep fusion model benefits from shared information, which is primarily associated with robustness as these input sources can face different levels of corruption. Thus, it is highly important that the deep fusion models used in autonomous vehicles are robust to corruption, especially to input sources that are weighted more heavily in different conditions. We explore a different approach in training the robustness for a deep fusion model through adversarial training. We fine-tune the model on adversarial examples and evaluate its robustness against single source noise and other forms of corruption. Our experimental results show that adversarial training was effective in improving the robustness of a deep fusion model object detector against adversarial noise and Gaussian noise while maintaining performance on clean data. The results also highlighted the lack of robustness of models that are not trained to handle adversarial examples. We believe that this is relevant given the risks that autonomous vehicles pose to pedestrians - it is important that we ensure the inferences and decisions made by the model are robust against corruption, especially if it is intentional from outside threats.

Healthcare: Adversarial Defense In Medical Deep Learning Systems

  • Group members: Rakesh Senthilvelan, Madeline Tjoa

Abstract: In order to combat against such adversarial instances, there needs to be robust training done with these models in order to best protect against the methods that these attacks use on deep learning systems. In the scope of this paper, we will be looking into the methods of fast gradient signed method and projected gradient descent, two methods used in adversarial attacks to maximize loss functions and cause the affected system to make opposing predictions, in order to train our models against them and allow for stronger accuracy when faced with adversarial examples.

Satellite image analysis

Ml for finance, ml for healthcare, fair ml, ml for science, actionable recourse.

  • Group members: Shweta Kumar, Trevor Tuttle, Takashi Yabuta, Mizuki Kadowaki, Jeffrey Feng

Abstract: In American society today there is a constant encouraged reliance on credit, despite it not being available to everyone as a legal right. Currently, there are countless evaluation methods of an individual's creditworthiness in practice. In an effort to regulate the selection criteria of different financial institutions, the Equal Credit Opportunity Act (ECOA) requires that applicants denied a loan are entitled to an Adverse Action notice, a statement from the creditor explaining the reason for the denial. However, these adverse action notices are frequently unactionable and ineffective in providing feedback to give an individual recourse, which is the ability to act up on a reason for denial to raise one’s odds of getting accepted for a loan. In our project, we will be exploring whether it is possible to create an interactive interface to personalize adverse action notices in alignment with personal preferences for individuals to gain recourse.

Social media; online communities; text analysis; ethics

Finding commonalities in misinformative articles across topics.

  • Group members: Hwang Yu, Maximilian Halvax, Lucas Nguyen

Abstract: In order to combat the large scale distribution of misinformation online, We wanted to develop a way to flag news articles that are misinformative and could potentially mislead the general public. In addition to flagging news articles, we also wanted to find commonalities between the misinformation that we found. Were some topics in specific containing more misleading information than others? How much overlap do these articles have when we break their content down into TF IDF and see what words carry the most importance when put into various models detecting misinformation. We wanted to narrow down our models to be trained on four different topics: economics, politics, science, and general which is a dataset encompassing the three previous topics. We Found that general included the most overlap overall, while the topics themselves, while mostly different than the other specific topics, had certain models that still put emphasis on similar words, indicating a possible pattern of misinformative language in these articles. We believe, from these results, that we can find a pattern that could direct further investigation into how misinformation is written and distributed online.

The Effect of Twitter Cancel Culture on the Music Industry

  • Group members: Peter Wu, Nikitha Gopal, Abigail Velasquez

Abstract: Musicians often trend on social media for various reasons but in recent years, there has been a rise in musicians being “canceled” for committing offensive or socially unacceptable behavior. Due to the wide accessibility of social media, the masses are able to hold accountable musicians for their actions through “cancel culture”, a form of modern ostracism. Twitter has become a well-known platform for “cancel culture” as users can easily spread hashtags and see what’s trending, which also has the potential to facilitate the spread of toxicity. We analyze how public sentiment towards canceled musicians on Twitter changes in respect to the type of issue they were canceled for, their background, and the strength of their parasocial relationship with their fans. Through our research, we aim to determine whether “cancel culture” leads to an increase in toxicity and negative sentiment towards a canceled individual.

Analyzing single cell multimodality data via (coupled) autoencoder neural networks

Coupled autoencoders for single-cell data analysis.

  • Group members: Alex Nguyen, Brian Vi

Abstract: Historically, analysis on single-cell data has been difficult to perform, due to data collection methods often resulting in the destruction of the cell in the process of collecting information. However, an ongoing endeavor of biological data science has recently been to analyze different modalities, or forms, of the genetic information within a cell. Doing so will allow modern medicine a greater understanding of cellular functions and how cells work in the context of illnesses. The information collected on the three modalities of DNA, RNA, and protein can be done safely and because it is known that they are same information in different forms, analysis done on them can be extrapolated understand the cell as a whole. Previous research has been conducted by Gala, R., Budzillo, A., Baftizadeh, F. et al. to capture gene expression in neuron cells with a neural network called a coupled autoencoder. This autoencoder framework is able to reconstruct the inputs, allowing the prediction of one input to another, as well as align the multiple inputs in the same low dimensional representation. In our paper, we build upon this coupled autoencoder on a data set of cells taken from several sites of the human body, predicting from RNA information to protein. We find that the autoencoder is able to adequately cluster the cell types in its lower dimensional representation, as well as perform decently at the prediction task. We show that the autoencoder is a powerful tool for analyzing single-cell data analysis and may prove to be a valuable asset in single-cell data analysis.

Machine Learning, Natural Language Processing

On evaluating the robustness of language models with tuning.

  • Group members: Lechuan Wang, Colin Wang, Yutong Luo

Abstract: Prompt tuning and prefix tuning are two effective mechanisms to leverage frozen language models to perform downstream tasks. Robustness reflects models’ resilience of output under a change or noise in the input. In this project, we analyze the robustness of natural language models using various tuning methods with respect to a domain shift (i.e. training on a domain but evaluating on out-of-domain data). We apply both prompt tuning and prefix tuning on T5 models for reading comprehension (i.e. question-answering) and GPT-2 models for table-to-text generation.

Activity Based Travel Models and Feature Selection

A tree-based model for activity based travel models and feature selection.

  • Group members: Lisa Kuwahara, Ruiqin Li, Sophia Lau

Abstract: In a previous study, Deloitte Consulting LLP developed a method of creating city simulations through cellular location and geospatial data. Using these simulations of human activity and traffic patterns, better decisions can be made regarding modes of transportation or road construction. However, the current commonly used method of estimating transportation mode choice is a utility model that involves many features and coefficients that may not necessarily be important but still make the model more complex. Instead, we used a tree-based approach - in particular, XGBoost - to identify just the features that are important for determining mode choice so that we can create a model that is simpler, robust, and easily deployable, in addition to performing better than the original utility model on both the full dataset and population subsets.

Explainable AI, Causal Inference

Explainable ai.

  • Group members: Jerry Chan, Apoorv Pochiraju, Zhendong Wang, Yujie Zhang

Abstract: Nowadays, the algorithmic decision-making system has been very common in people’s daily lives. Gradually, some algorithms become too complex for humans to interpret, such as some black-box machine learning models and deep neural networks. In order to assess the fairness of the models and make them better tools for different parties, we need explainable AI (XAI) to uncover the reasoning behind the predictions made by those black-box models. In our project, we will be focusing on using different techniques from causal inferences and explainable AI to interpret various classification models across various domains. In particular, we are interested in three domains - healthcare, finance, and the housing market. Within each domain, we are going to train four binary classification models first, and we have four goals in general: 1) Explaining black-box models both globally and locally with various XAI methods. 2) Assessing the fairness of each learning algorithm with regard to different sensitive attributes; 3) Generating recourse for individuals - a set of minimal actions to change the prediction of those black-box models. 4) Evaluating the explanations from those XAI methods using domain knowledge.

AutoML Platforms

Deep learning transformer models for feature type inference.

  • Group members: Andrew Shen, Tanveer Mittal

Abstract: The first step AutoML software must take after loading in the data is to identify the feature types of individual columns in input data. This information then allows the software to understand the data and then preprocess it to allow machine learning algorithms to run on it. Project Sortinghat of the ADA lab at UCSD frames this task of Feature Type Inference as a machine learning multiclass classification problem. Machine learning models defined in the original SortingHat feature type inference paper use 3 sets of features as input. 1. The name of the given column 2. 5 not null sample values 3. Descriptive numeric features about the column The textual features are easy to access, however the descriptive statistics previous models rely on require a full pass through the data which make preprocessing less scalable. Our goal is to produce models that may rely less on these statistics by better leveraging the textual features. As an extension of Project SortingHat, we experimented with deep learning transformer models and varying the sample sizes used by random forest models. We found that our transformer models achieved state of the art results on this task which outperform all existing tools and ML models that have been benchmarked against SortingHat's ML Data Prep Zoo. Our best model used a pretrained Bidirectional Encoder Representations Transformer(BERT) language model to produce word embeddings which are then processed by a Convolutional Neural Network(CNN) model. As a result of this project, we have published 2 BERT CNN models using the PyTorch Hub api. This is to allow software engineers to easily integrate our models or train similar ones for use in AutoML platforms or other automated data preparation applications. Our best model uses all the features defined above, while the other only uses column names and sample values while offering comparable performance and much better scalability for all input data.

Exploring Noise in Data: Applications to ML Models

  • Group members: Cheolmin Hwang, Amelia Kawasaki, Robert Dunn

Abstract: In machine learning, models are commonly built in such a way to avoid what is known as overfitting. As it is generally understood, overfitting is when a model is fit exactly to the training data causing the model to have poor performance on new examples. This means that overfit models tend to have poor accuracy on unseen data because the model is fit exactly to the training data. Therefore, in order to generalize to all examples of data and not only the examples found in a given training set, models are built with certain techniques to avoid fitting the data exactly. However, it can be found that overfitting does not always work in this way that one might expect as will be shown by fitting models with a given level of noisiness. Specifically, it is seen that some models fit exactly to data with high levels of noise still produce results with high accuracy whereas others are more prone to overfitting.

Group Testing for Optimizing COVID-19 Testing

Covid-19 group testing optimization strategies.

  • Group members: Mengfan Chen, Jeffrey Chu, Vincent Lee, Ethan Dinh-Luong

Abstract: The COVID-19 pandemic that has persisted for more than two years has been combated by efficient testing strategies that reliably identifies positive individuals to slow the spread of the pandemic. Opposed to other pooling strategies within the domain, the methods described in this paper prioritize true negative samples over overall accuracy. In the Monte Carlo simulations, both nonadaptive and adaptive testing strategies with random pool sampling resulted in high accuracy approaching at least 95% with varying pooling sizes and population sizes to decrease the number of tests given. A split tensor rank 2 method attempts to identify all infected samples within 961 samples, converging the number of tests to 99 as the prevalence of infection converges to 1%.

Causal Discovery

Patterns of fairness in machine learning.

  • Group members: Daniel Tong, Anne Xu, Praveen Nair

Abstract: Machine learning tools are increasingly used for decision-making in contexts that have crucial ramifications. However, a growing body of research has established that machine learning models are not immune to bias, especially on protected characteristics. This had led to efforts to create mathematical definitions of fairness that could be used to estimate whether, given a prediction task and a certain protected attribute, an algorithm is being fair to members of all classes. But just like how philosophical definitions of fairness can vary widely, mathematical definitions of fairness vary as well, and fairness conditions can in fact be mutually exclusive. In addition, the choice of model to use to optimize fairness is also a difficult decision we have little intuition for. Consequently, our capstone project centers around an empirical analysis for studying the relationships between machine learning models, datasets, and various fairness metrics. We produce a 3-dimensional matrix of the performance of a certain machine learning model, for a certain definition of fairness, for a certain given dataset. Using this matrix on a sample of 8 datasets, 7 classification models, and 9 fairness metrics, we discover empirical relationships between model type and performance on specific metrics, in addition to correlations between metric values across different dataset-model pairs. We also offer a website and command-line interface for users to perform this experimentation on their own datasets.

Causal Effects of Socioeconomic and Political Factors on Life Expectancy in 166 Different Countries

  • Group members: Adam Kreitzman, Maxwell Levitt, Emily Ramond

Abstract: This project examines causal relationships between various socioeconomic variables and life expectancy outcomes in 166 different countries, with the ability to account for new, unseen data and variables with an intuitive data pipeline process with detailed instructions and the PC algorithm with updated code to account for missingness in data. With access to this model and pipeline, we hope that questions such as “do authoritarian countries have a direct relation to life expectancy?” or “how does women in government affect perceived notion of social support?” will now be able to be answered and understood. Through our own analysis, we were able to find intriguing results, such as a higher Perception of Corruption is distinctly related to a lower Life Ladder score. We also found that higher quality of life perceptions is related to lower economic inequality. These results aim to educate not only the general public, but government officials as well.

Time series analysis in health

Time series analysis on the effect of light exposure on sleep quality.

  • Group members: Shubham Kaushal, Yuxiang Hu, Alex Liu

Abstract: The increase of artificial light exposure through the increased prevalence of technology has an affect on the sleep cycle and circadian rhythm of humans. The goal of this project is to determine how different colors and intensities of light exposure prior to sleep affects the quality of sleep through the classification of time series data.

Sleep Stage Classification for Patients With Sleep Apnea

  • Group members: Kevin Chin, Yilan Guo, Shaheen Daneshvar

Abstract: Sleeping is not uniform and consists of four stages: N1, N2, N3, and REM sleep. The analysis of sleep stages is essential for understanding and diagnosing sleep-related diseases, such as insomnia, narcolepsy, and sleep apnea; however, sleep stage classification often does not generalize to patients with sleep apnea. The goal of our project is to build a sleep stage classifier specifically for people with sleep apnea and understand how it differs from the normal sleep stage. We will then explore whether or not the inclusion and featurization of ECG data will improve the performance of our model.

Environmental health exposures & pollution modeling & land-use change dynamics

Supervised classification approach to wildfire mapping in northern california.

  • Group members: Alice Lu, Oscar Jimenez, Anthony Chi, Jaskaranpal Singh

Abstract: Burn severity maps are an important tool for understanding fire damage and managing forest recovery. We have identified several issues with current mapping methods used by federal agencies that affect the completeness, consistency, and efficiency of their burn severity maps. In order to address these issues, we demonstrate the use of machine learning as an alternative to traditional methods of producing severity maps, which rely on in-situ data and spectral indices derived from image algebra. We have trained several supervised classifiers on sample data collected from 17 wildfires across Northern California and evaluate their performance at mapping fire severity.

Network Performance Classification

Network signal anomaly detection.

  • Group members: Laura Diao, Benjamin Sam, Jenna Yang

Abstract: Network degradation occurs in many forms, and our project will focus on two common factors: packet loss and latency. Packet loss occurs when one or more data packets transmitted across a computer network fail to reach their destination. Latency can be defined as a measure of delay for data to transmit across a network. For internet users, high rates of packet loss and significant latency can manifest in jitter or lag, which are indicators of overall poor network performance as perceived by the end user. Thus, when issues arise in these two factors, it would be beneficial for internet service providers to know exactly when the user is experiencing problems in real time. In real world scenarios, situations or environments such as poor port quality, overloaded ports, network congestion and more can impact overall network performance. In order to detect some of these issues in network transmission data, we built an anomaly detection system that predicts the estimated packet loss and latency of a connection and detects whether there is a significant degradation of network quality for the duration of the connection.

Real Time Anomaly Detection in Networks

  • Group members: Justin Harsono, Charlie Tran, Tatum Maston

Abstract: Internet companies are expected to deliver the speed their customer has paid for. However, for various reasons such as congestion or connectivity issues, it is inevitable for one to perceive degradations in network quality. To still ensure the customer is satisfied, certain monitoring systems must be built to inspect the quality of the connection. Our goal is to build a model that would be able to detect, in real time, these regions of networks degradations, so that an appropriate recovery can be enacted to offset these degradations. Our solution is a combination of two anomaly detection methods that successfully detects shifts in the data, based on a rolling window of data it has seen.

System Usage Reporting

Intel telemetry: data collection & time-series prediction of app usage.

  • Group members: Srikar Prayaga, Andrew Chin, Arjun Sawhney

Abstract: Despite advancements in hardware technology, PC users continue to face frustrating app launch times, especially on lower end Windows machines. The desktop experience differs vastly from the instantaneous app launches and optimized experience we have come to expect even from low end smartphones. We propose a solution to preemptively run Windows apps in the background based on the app usage patterns of the user. Our solution is two-step. First, we built telemetry collector modules in C/C++ to collect real-world app usage data from two of our personal Windows 10 devices. Next, we developed neural network models, trained on the collected data, to predict app usage times and corresponding launch sequences in python. We achieved impressive results on selected evaluation metrics across different user profiles.

Predicting Application Use to Reduce User Wait Time

  • Group members: Sasami Scott, Timothy Tran, Andy Do

Abstract: Our goal for this project was to lower the user wait time when loading programs by predicting the next used application. In order to obtain the needed data, we created data collection libraries. Using this data, we created a Hidden Markov Model (HMM) and a Long Short-Term Memory (LSTM) model, but the latter proved to be better. Using LSTM, we can predict the application use time and expand this concept to more applications. We created multiple LSTM models with varying results, but ultimately chose a model that we think had potential. We decided on using the model that reported a 90% accuracy.

INTELlinext: A Fully Integrated LSTM and HMM-Based Solution for Next-App Prediction With Intel SUR SDK Data Collection

  • Group members: Jared Thach, Hiroki Hoshida, Cyril Gorlla

Abstract: As the power of modern computing devices increases, so too do user expectations for them. Despite advancements in technology, computer users are often faced with the dreaded spinning icon waiting for an application to load. Building upon our previous work developing data collectors with the Intel System Usage Reporting (SUR) SDK, we introduce INTELlinext, a comprehensive solution for next-app prediction for application preload to improve perceived system fluidity. We develop a Hidden Markov Model (HMM) for prediction of the k most likely next apps, achieving an accuracy of 64% when k = 3. We then implement a long short-term memory (LSTM) model to predict the total duration that applications will be used. After hyperparameter optimization leading to an optimal lookback value of 5 previous applications, we are able to predict the usage time of a given application with a mean absolute error of ~45 seconds. Our work constitutes a promising comprehensive application preload solution with data collection based on the Intel SUR SDK and prediction with machine learning.

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

EveThan/IBM-Applied-Data-Science-Capstone-Project

Folders and files.

NameName
5 Commits

Repository files navigation

Ibm applied data science capstone project.

The PowerPoint slides for this project can be found at Capstone_Presentation.pptx or Capstone_Presentation.pdf .

Executive summary

In this capstone project, we will predict if the SpaceX Falcon 9 first stage will land successfully using several machine learning classification algorithms. The main steps in this project include:

  • Data collection, wrangling, and formatting
  • Exploratory data analysis
  • Interactive data visualization
  • Machine learning prediction

Our graphs show that some features of the rocket launches have a correlation with the outcome of the launches, i.e., success or failure. It is also concluded that decision tree may be the best machine learning algorithm to predict if the Falcon 9 first stage will land successfully.

Introduction

In this capstone, we will predict if the Falcon 9 first stage will land successfully. SpaceX advertises Falcon 9 rocket launches on its website with a cost of 62 million dollars; other providers cost upward of 165 million dollars each, much of the savings is because SpaceX can reuse the first stage. Therefore if we can determine if the first stage will land, we can determine the cost of a launch. This information can be used if an alternate company wants to bid against SpaceX for a rocket launch.

Most unsuccessful landings are planned. Sometimes, SpaceX will perform a controlled landing in the ocean. The main question that we are trying to answer is, for a given set of features about a Falcon 9 rocket launch which include its payload mass, orbit type, launch site, and so on, will the first stage of the rocket land successfully?

Methodology

The overall methodology includes:

  • Data collection, wrangling, and formatting, using:
  • Web scraping
  • Exploratory data analysis (EDA), using:
  • Pandas and NumPy
  • Data visualization, using:
  • Matplotlib and Seaborn
  • Machine learning prediction, using
  • Logistic regression
  • Support vector machine (SVM)
  • Decision tree
  • K-nearest neighbors (KNN)

Data collection using SpaceX API

1_Data Collection API.ipynb

Libraries or modules used: requests, pandas, numpy, datetime

  • The API used is here .
  • The API provides data about many types of rocket launches done by SpaceX, the data is therefore filtered to include only Falcon 9 launches.
  • The API is accessed using requests.get().
  • The json result is converted to a dataframe using the json_normalize() function from pandas.
  • Every missing value in the data is replaced the mean the column that the missing value belongs to.
  • We end up with 90 rows or instances and 17 columns or features.

Data Collection with Web Scraping

2_Data Collection with Web Scraping.ipynb

Libraries or modules used: sys, requests, BeautifulSoup from bs4, re, unicodedata, pandas

  • The data is scraped from List of Falcon 9 and Falcon Heavy launches .
  • The website contains only the data about Falcon 9 launches.
  • First, the Falcon9 Launch Wiki page is requested from the url and a BeautifulSoup object is created from response of requests.get().
  • Next, all column/variable names are extracted from the HTML table header by using the find_all() function from BeautifulSoup.
  • A dataframe is then created with the extracted column names and entries filled with launch records extracted from table rows.
  • We end up with 121 rows or instances and 11 columns or features.

EDA with Pandas and Numpy

3_EDA.ipynb

Libraries or modules used: pandas, numpy

Functions from the Pandas and NumPy libraries such as value_counts() are used to derive basic information about the data collected, which includes:

  • The number of launches on each launch site
  • The number of occurrence of each orbit
  • The number and occurrence of each mission outcome

EDA with SQL

4_EDA with SQL.ipynb

Framework used: IBM DB2

Libraries or modules used: ibm_db

The data is queried using SQL to answer several questions about the data such as:

  • The names of the unique launch sites in the space mission
  • The total payload mass carried by boosters launched by NASA (CRS)
  • The average payload mass carried by booster version F9 v1.1

The SQL statements or functions used include SELECT, DISTINCT, AS, FROM, WHERE, LIMIT, LIKE, SUM(), AVG(), MIN(), BETWEEN, COUNT(), and YEAR().

Data Visualization using Matplotlib and Seaborn

5_EDA Visualization.ipynb

Libraries or modules used: pandas, numpy, matplotlib.pyplot, seaborn

Functions from the Matplotlib and Seaborn libraries are used to visualize the data through scatterplots, bar charts, and line charts. The plots and charts are used to understand more about the relationships between several features, such as:

  • The relationship between flight number and launch site
  • The relationship between payload mass and launch site
  • The relationship between success rate and orbit type

Examples of functions from seaborn that are used here are scatterplot(), barplot(), catplot(), and lineplot().

Picture 1

Data Visualization using Folium

6_Interactive Visual Analytics with Folium lab.ipynb

Libraries or modules used: folium, wget, pandas, math

Functions from the Folium libraries are used to visualize the data through interactive maps. The Folium library is used to:

  • Mark all launch sites on a map
  • Mark the succeeded launches and failed launches for each site on the map
  • Mark the distances between a launch site to its proximities such as the nearest city, railway, or highway

These are done using functions from folium such as add_child() and folium plugins which include MarkerCluster, MousePosition, and DivIcon.

Picture 2

Data Visualization using Dash

7_spacex_dash_app.py

Libraries or modules used: pandas, dash, dash_html_components, dash_core_components, Input and Output from dash.dependencies, plotly.express

Functions from Dash are used to generate an interactive site where we can toggle the input using a dropdown menu and a range slider. Using a pie chart and a scatterplot, the interactive site shows:

  • The total success launches from each launch site
  • The correlation between payload mass and mission outcome (success or failure) for each launch site

The application is launched on a terminal on the IBM Skills Network website.

Picture 3

Machine Learning Prediction

8_Machine Learning Prediction.ipynb

Libraries or modules used: pandas, numpy, matplotlib.pyplot, seaborn, sklearn

Functions from the Scikit-learn library are used to create our machine learning models. The machine learning prediction phase include the following steps:

  • Standardizing the data using the preprocessing.StandardScaler() function from sklearn
  • Splitting the data into training and test data using the train_test_split function from sklearn.model_selection
  • Creating machine learning models, which include:
  • Logistic regression using LogisticRegression from sklearn.linear_model
  • Support vector machine (SVM) using SVC from sklearn.svm
  • Decision tree using DecisionTreeClassifier from sklearn.tree
  • K nearest neighbors (KNN) using KNeighborsClassifier from sklearn.neighbors
  • Fit the models on the training set
  • Find the best combination of hyperparameters for each model using GridSearchCV from sklearn.model_selection
  • Evaluate the models based on their accuracy scores and confusion matrix using the score() function and confusion_matrix from sklearn.metrics

Putting the results of all 4 models side by side, we can see that they all share the same accuracy score and confusion matrix when tested on the test set. Therefore, their GridSearchCV best scores are used to rank them instead. Based on the GridSearchCV best scores, the models are ranked in the following order with the first being the best and the last one being the worst:

  • Decision tree (GridSearchCV best score: 0.8892857142857142)
  • K nearest neighbors, KNN (GridSearchCV best score: 0.8482142857142858)
  • Support vector machine, SVM (GridSearchCV best score: 0.8482142857142856)
  • Logistic regression (GridSearchCV best score: 0.8464285714285713)

Picture 5

From the data visualization section, we can see that some features may have correlation with the mission outcome in several ways. For example, with heavy payloads the successful landing or positive landing rate are more for orbit types Polar, LEO and ISS. However, for GTO, we cannot distinguish this well as both positive landing rate and negative landing(unsuccessful mission) are both there here.

Therefore, each feature may have a certain impact on the final mission outcome. The exact ways of how each of these features impact the mission outcome are difficult to decipher. However, we can use some machine learning algorithms to learn the pattern of the past data and predict whether a mission will be successful or not based on the given features.

In this project, we try to predict if the first stage of a given Falcon 9 launch will land in order to determine the cost of a launch. Each feature of a Falcon 9 launch, such as its payload mass or orbit type, may affect the mission outcome in a certain way.

Several machine learning algorithms are employed to learn the patterns of past Falcon 9 launch data to produce predictive models that can be used to predict the outcome of a Falcon 9 launch. The predictive model produced by decision tree algorithm performed the best among the 4 machine learning algorithms employed.

~ Project created in January 2022 ~

  • Jupyter Notebook 99.5%
  • Python 0.5%

Capstone Projects

Education is one of the pillars of the data science institute..

Through educational activities, we strive to create a community in Data Science at Columbia. The capstone project is one of the most lauded elements of our MS in Data Science program. As a final step during their study at Columbia, our MS students work on a project sponsored by a DSI industry affiliate or a faculty member over the course of a semester.

Faculty-Sponsored Capstone Projects

A DSI faculty member proposes a research project and advises a team of students working on this project. This is a great way to run a research project with enthusiastic students, eager to try out their newly acquired data science skills in a research setting. This is especially a good opportunity for developing and accelerating interdisciplinary collaboration.

2023-2024 Academic Year: July 15, 2023 via this form

Project Archive

  • Spring 2022
  • Spring 2020
  • Spring 2019
  • Spring 2018
  • Spring 2016

Data Science Capstone Projects #18

by Ekaterina Butyugina

data-science-city-and-data

Cortexia: Sustainable Clean City - Darkzones Analytics

Students: Dominik Bacher , Valeriia Rutskaia

Results after the predictions

Talmis: Macroeconomic forecasting using machine learning methods

Students: Hussam Al-Homsi , Patrizia Will

  • First, they applied time-series clustering to group the 196 countries into clusters with similar historical trends/shapes of the respective MEV. 
  • Then, proceeded to perform statistical filtering using the Granger Causality Test and thereby select countries with higher predictive power towards their targeted country per respective MEV (we used p < 0.05). 
  • Finally, by applying a combination of Facebook’s additive model “Prophet” and the multivariate vector autoregressive model (VAR) they were able to stepwise predict the MEVs year by year.

Target Country GDP vs User Imputed GDP UK

  • Additional algorithms should be tested to expand and deepen the understanding of the resilience of the banks.
  • The global MEV data set should be enhanced and include quarterly data to allow for higher precision forecasting.
  • The approach does not include weighing in the trading relations between the countries. For instance, countries with stronger ties in global trade should receive more weight by the model than countries with lower mutual trade volume. This factor should be included as the next step in future models.

CancerDataNet: Time predictions for follow-up treatment in cancer patients

Students: Muchun Zhong , Jacques Stimolo , Ernest Mihelj

  • In the first part, they conducted research within the medical study documentation and the data to gain a better understanding of the data and hence to find anomalies in it. 
  • The second step was the cleaning of the data, where they removed the anomalous data and cleaned the data based on the missing rate. 
  • The final step was to take the final version of the dataset and create a synthetic replacement for the missing values (imputation). Muchun, Jacques and Ernest implemented different strategies to impute the data and compared the performance/accuracy of the prognostic models.

The Prognostic Models

360° Stock Prediction: Predicting the highest return stocks globally via robust KPIs and perceived company confidence

 Students: Karim Khalil , Fernando Beato , Lukas Doboczky, Rafael Zack  

Stock 360 Logo

Interested in reading more about Constructor Academy and tech related topics? Then check out our other blog posts.

Blog

sample capstone project for data science

Interested in hearing more about Bay Path University? Please select a program below:

Interested in applying to Bay Path University? Please select an application below:

Additional Navigation

Applied data science (ms) student capstone projects.

Case Analysis Capstone (ADS670) aims to develop both technical and soft skills that are not directly taught in the traditional courses in the program, but are relevant and critical in order to develop, innovate and communicate in modern data science. This is a project-oriented capstone that will harness the skills gained throughout the program.

Below are some examples of original ​research studies done by students in our master's in Applied Data Science program for their completed capstone projects.

Capstone Projects

The culminating experience in the Master’s in Applied Data Science program is a Capstone Project where you’ll put your knowledge and skills into practice . You will immerse yourself in a real business problem and will gain valuable, data driven insights using authentic data. Together with project sponsors, you will develop a data science solution to address organization problems, enhance analytics capabilities, and expand talent pools and employment opportunities. Leveraging the university’s rich research portfolio, you also have the option to join a research-focused team .

Selected Capstone Projects

Copd readmission and cost reduction assessment, an nfl ticket pricing study: optimizing revenue using variable and dynamic pricing methods, using image recognition to identify yoga poses, using image recognition to measure the speed of a pitch, real-time credit card fraud detection, interested in becoming a capstone sponsor.

The Master’s in Applied Data Science program accepts projects year-round for placement at the beginning of every quarter, with the Spring quarter being the largest cohort. All projects must be submitted no later than one month prior to the beginning of the preferred starting quarter based on the UChicago academic calendar .

Capstone Sponsor Incentives

Sponsors derive measurable benefits from this unique opportunity to support higher education. Partner organizations propose real-world problems, untested ideas or research queries. Students review them from the perspective of data scientists trained to generate actionable insights that provide long-term value. Through the project, Capstone partners gain access to a symbiotic pool of world-class students, highly accomplished instructors, and cited researchers, resulting in optimized utilization of modern data science-based methods, using your data. Further, for many sponsors, the project becomes a meaningful source of recruitment through the excellent pool of students who work on your project.

Capstone Sponsor Obligations

While there is no monetary cost or contract necessary to sponsor a project, we do consider this a partnership. Teams comprised of four students and guided by an instructor and subject matter expert are provided with expectations from the capstone sponsor and learning objectives, assignments, and evaluation requirements from instructors. In turn, Capstone partners should be prepared to provide the following:

  • A detailed problem statement with a description of the data and expected results
  • Two or more points of contact
  • Access to data relevant to the project by the first week of the applicable quarter
  • Engagement through regular meetings (typically bi-weekly) while classes are in session
  • If requested, a non-disclosure agreement that may be completed by the student team

Interested in Becoming a Capstone or Industry Research Partner?

Get in touch with us to submit your idea for a collaboration or ask us questions about how the partnership process works.

Apply Today

The application portal for entrance in Autumn 2024 is now open ! Explore our In-Person and Online programs.

 


This course is about projects with real world data for students in data science. Prerequisite : (statistical) machine learning.

Instructors:

Time and place:.

TuTh 3:00-4:20pm, Rm 5510, Lift 25-26, Zoom online, HKUST Tutorial session: Tu 6:00-6:50pm, Rm 5510, Lift 25-26, Zoom online, HKUST This term we will be using Piazza for class discussion. The system is highly catered to getting you help fast and efficiently from classmates and myself. Rather than emailing questions to the teaching staff, I encourage you to post your questions on Piazza. If you have any problems or feedback for the developers, email [email protected]. Find our class page at: https://piazza.com/ust.hk/spring2020/mafs6010u/home --->

Reference (参考教材)

An Introduction to Statistical Learning, with applications in R (ISLR). By James, Witten, Hastie, and Tibshirani

ISLR-python, By Jordi Warmenhoven .

ISLR-Python: Labs and Applied, by Matt Caudill .

Manning: Deep Learning with Python , by Francois Chollet [ GitHub source in Python 3.6 and Keras 2.0.8 ]

MIT: Deep Learning , by Ian Goodfellow, Yoshua Bengio, and Aaron Courville

Kaggle Contest: Predict Survival on the Titanic .

Kaggle Contest: Home Credit Default Risk Prediction .

Kaggle Contest: Nexperia Image Classification (Second Stage, on-going) .

Kaggle Contest: Nexperia Image Classification (First Stage, finished) .

Tutorials: preparation for beginners

Python-Numpy Tutorials by Justin Johnson

scikit-learn Tutorials : An Introduction of Machine Learning in Python

Jupyter Notebook Tutorials

PyTorch Tutorials

Deep Learning: Do-it-yourself with PyTorch , A course at ENS

Tensorflow Tutorials

MXNet Tutorials

Theano Tutorials

statlearning-notebooks , by Sujit Pal, Python implementations of the R labs for the StatLearning: Statistical Learning online course from Stanford taught by Profs Trevor Hastie and Rob Tibshirani.

Homework and Projects:

TBA (To Be Announced)

Teaching Assistant:

Email: Mr. LIANG, Zhicong zliangak (add "AT connect DOT ust DOT hk" afterwards) >

--->
02/09/2021, Tue Lecture 01: History and Overview of Artificial Intelligence. [ ] Y.Y.
07/09/2021, Tue Lecture 02: Supervised Learning: Linear Regression with Python Y.Y.
09/09/2021, Thu Lecture 03: Linear Classification with Python Y.Y.
14/09/2021, Tue Lecture 04: Project 1 , Model Assessment and Selection I: Subset, Forward, and Backward Selection


Machine Learning for Survival Prediction of Passengers on the Titanic.
[ ] [ ] [ ] [ ] [ ]


Predict Survival on the Titanic.
[ ] [ ] [ ] [ ] [ ]


Machine Learning Basics Kaggle Contest: Home Credit Default Risk
[ ] [ ] [ ] [ ] [ ]


Machine Learning Basics Kaggle Contest: Home Credit Default Risk
[ ] [ ] [ ] [ ] [ ]

[ ] ] [ ] [ ] [ ] [ ]


Model Selection and Regularization on Prediction of Survival on the Titanic
[ ] [ ] [ ] [ ] [ ]


Supervised Classification with Full Spectrum or Premium Subset?
[ ] [ ] [ ] [ ] [ ]

Y.Y.
16/09/2021, Thu Lecture 05: Model Assessment and Selection II: Ridge, Lasso, and Principal Component Regression
Y.Y.
21/09/2021, Tue Lecture 06: Decision Trees
Y.Y.
23/09/2021, Thu Lecture 07: Bagging, Random Forests and Boosting
Y.Y.
28/09/2021, Tue Lecture 08: Support Vector Machines I
Y.Y.
30/09/2021, Thu Lecture 09: Support Vector Machines II
Y.Y.
05/10/2021, Tue Lecture 10: An Introduction to Convolutional Neural Networks [ ]
] ] ] ]
Y.Y.
07/10/2021, Thu Lecture 11: Examples of Convolutional Neural Networks.
] ] ] ]
Y.Y.
12/10/2021, Tue Lecture 12: Seminar


Machine Learning for Survival Prediction of Passengers on the Titanic.
[ ] [ ] [ ]


Predict Survival on the Titanic.
[ ] [ ] [ ]

Y.Y.
19/10/2021, Tue Lecture 13: Seminar

Model Selection and Regularization on Prediction of Survival on the Titanic
[ ] [ ] [ ]


Machine Learning Basics Kaggle Contest: Home Credit Default Risk
[ ] [ ] [ ]


Supervised Classification with Full Spectrum or Premium Subset?
[ ] [ ] [ ]

Y.Y.
21/10/2021, Thu Lecture 14: Seminar and Project 2 [ ]

Machine Learning Basics Kaggle Contest: Home Credit Default Risk
[ ] [ ] [ ]


Workers Supervision for Construction Safety
[ ] [ ] [ ]

]

Y.Y.
26/10/2021, Tue Lecture 15: An Introduction to Recurrent Neural Networks (RNN) [ ]
] ] [ ] ] --->
Y.Y.
28/10/2021, Thu Lecture 16: Long-Short-Term-Memory (LSTM) [ ]

] ] [ ] [ ] [ ] ---> ] [ ] ] [ ] --->

Y.Y.
02/11/2021, Tue Lecture 17: Attention and Transformer [ ]
] [ ] [ ] ] ] ] ] --->
Y.Y.
04/11/2021, Thu Lecture 18: BERT (Bidirectional Encoder Representations from Transformers) [ ]
] [ ] [ ] [ ] ] ] ---> ] ] ] ]

] of the group reports that you reviewed, and please send your changes of ratings if any.

Y.Y.
09/11/2021, Tue Lecture 19: An Introduction to Reinforcement Learning and Deep Q-Learning [ ]
] ] ] ] ] --->
Y.Y.
11/11/2021, Thu Lecture 20: An Introduction to Reinforcement Learning: Policy Gradient and Actor-Critic Methods [ ]
] ] ] [ ] ] ] ] ] --->
Y.Y.
16/11/2021, Tue Lecture 21: An Introduction to Unsupervised Learning: PCA, AutoEncoder, VAE, and GANs [ ]
] ]


The Disaster Tweets - Text Classification.
[ ] [ ] [ ] [ ] [ ]


Pawpularity Prediction.
[ ] [ ] [ ] [ ] [ ]


Limitations of Translation: How much translation affect the analysis of Chinese text in different models?
[ ] [ ] [ ] [ ] [ ]


G-Research Crypto Forecasting (Kaggle)
[ ] [ ] [ ] [ ] ] --->


Workers Supervision for Construction Safety
[ ] [ ] [ ] [ ] ] [ ] --->


Comparison of Tree-based model and linear model on the Titanic dataset
[ ] [ ] [ ] [ ] ] [ ] --->


Pawpularity Prediction Using Meching Learning with Tabular Metadata and Images
[ ] [ ] [ ] [ ] ] --->

Y.Y.
18/11/2021, Thu Lecture 22: Seminar

Limitations of Translation: How much translation affect the analysis of Chinese text in different models?
[ ] [ ] [ ] ] [ ] --->

Y.Y.
23/11/2021, Tue Lecture 23: Seminar

Pawpularity Prediction.
[ ] [ ] [ ] ] [ ] --->


Pawpularity Prediction Using Meching Learning with Tabular Metadata and Images
[ ] [ ] [ ] ] [ ] --->

Y.Y.
25/11/2021, Thu Lecture 24: Seminar

G-Research Crypto Forecasting (Kaggle)
[ ] [ ] [ ] ] [ ] --->


Comparison of Tree-based model and linear model on the Titanic dataset
[ ] [ ] [ ] ] [ ] --->

Y.Y.
30/11/2021, Tue Lecture 25: Seminar

Workers Supervision for Construction Safety
[ ] [ ] [ ] ] [ ] [ ] --->


The Disaster Tweets - Text Classification.
[ ] [ ] [ ] ] [ ] --->

Y.Y.
Lecture 08: Bagging, Random Forests and Boosting


Predict survival on the Titanic.
[ ] [ ]


Titanic: Machine Learning from Disaster.
[ ] [ ]


Home Credit Default Risk.
[ ] [ ]


Logistic Regression Models on Titanic Survival Prediction.
[ ] [ ]


Predict Survival on the Titanic.
[ ] [ ] [ ]


What sorts of people were more likely to survive?
[ ] [ ] [ ]


Machine Learning on predicting survival of Titanics passengers by using SVM model.
[ ] [ ]

Y.Y.
10/11/2020, Tue Lecture 17: Topics in CNN: Visualization, Transfer Learning [ ]
] ] ]
Y.Y.
12/11/2020, Thu Lecture 18: Topics in CNN: Visualization, Transfer Learning, Neural Style, and Adversarial Examples [ ]
] ] ] ] ]
Y.Y.
05/19/2020, Tue Lecture 13: Tutorial on Reinforcement Learning in Quantitative Trading [ ]
] ] ] ] ] ] and Anthony Woo ]


NLP Chatbot for CogX Website.
[ ] [ ] [ ] [ ]


Financial Chatbot Based on LSTM.
[ ] [ ] [ ]


Chatbot Based on Prepared Answer Set.
[ ] [ ]


NLP Chatbot.
[ ] [ ] [ ]


M5 Forecasting with LightGBM.
[ ] [ ] [ ] [ ]


Kaggle - M5 Forecasting - Accuracy.
[ ] [ ] [ ]


M5 Forecasting Competition Report.
[ ] [ ] [ ]


Wal-Mart Sales Prediction using XGBoost Algorithm.
[ ] [ ] [ ]


Home Credit Default Risk.
[ ] [ ] [ ]


Estimate the Unit Sales of Walmart Retail Goods in the USA.
[ ] [ ] [ ]


TIME SERIES PREDICTION ON SALES OF WALMART.
[ ] [ ] [ ]


Estimate the unit sales of Walmart retail goods.
[ ] [ ] [ ]


Project of Artificial Intelligence in Finance.
[ ] [ ] [ ] [ ]


Smart Beta Trading Strategy Based on Fundamental Factors.
[ ] [ ] [ ]

Weizhi ZHU
A.W.
Y.Y.
03/8/2019, Fri Lecture 05: Tutorials
Yifei Huang;
Katrina Fong;
Anthony Woo
03/15/2019, Fri Lecture 06. Topics in Blockchains
: , CEO, VEE Technology LLC and Dr. Chen NING. : This is a brief introduction of Blockchain consensus and its current application in Finance, vision and outlook of Blockchain in Fintech. : Dr Alex Yang is a FinTech entrepreneur/investor with over 14 years of experience in banking and finance. VEE Technology is led by Sunny King, a blockchain legendary developer and creator of Proof-of-Stake consensus. As CEO of VEE Tech, Alex is driving the project to solve the core scalability and stability problems in the development of the blockchain industry. His deep experience of the industry has been gained through his investing activity where he has sponsored many world-leading blockchain foundations.
Prior to his role at VEE, Alex was the founder and CEO of Fund V, one of the first token funds to focus on blockchain companies and related investment opportunities. He was also the founding partner of Beam VC and CyberCarrier Capital which together have successfully invested in over 30 startups in the TMT sector. Alex is a founding partner of Protoss Global Opportunity Fund, a fixed income hedge fund based in Hong Kong.
Prior to moving into venture capital investing, Alex was based in Hong Kong as head of APAC structured rates trading at Nomura International, and VP of exotic derivatives trading at UBS. He started his career as a quantitative developer at Jump Trading in Chicago.
Alex has a PhD from Northwestern University and a BA in Mathematics from Peking University. : [ ]
A.W.
Y.Y.
05/03/2019, Fri Seminar: Investment Trends and FinTech Outlook
: Sales and Trading Business in Global Investment Banks Ripe for Disruption by AI? Mr. Christopher Lee Mr. Chris Lee is a partner at FAA Investments and a board director with expertise in financial markets, risk management, governance and leadership development. Currently, he serves as an Independent Board Member with Matthews Asia Funds (AUM: US$30.2 billion), the largest US investment company with a focus on Asia Pacific markets and Asian Masters Fund, an investment company listed in Australia. Previously, Chris was an investment banker for 18 years, acting as Managing Director and divisional and regional heads at Deutsche Bank AG, UBS Investment Bank and Bank of America Merrill Lynch. He worked in global capital markets, managed derivative products, and provided equity sales and trading functions to institutional investors. Academically, Chris is an associate professor of science practice at HKUST and teaches financial mathematics and risk management courses. He completed the AMP at Harvard University and holds a BS in Mechanical Engineering and an MBA from U.C. Berkeley.
Bloomberg Profile: [ ]
Chris Lee
A.W.
05/10/2019, Fri Lecture 12: Tutorial on deep learning in Python
]
Yifei Huang
  • Utility Menu

University Logo

Guide to the ALM Capstone Project

Customstyles.

  • Course Catalog

Data Science Capstone

This capstone course is the culmination of the Master of Liberal Arts, data science, where students execute their research proposal from  CSCI S-597 . It  gives students the opportunity to collaborate on a complex research topic using their data science skills.  At the completion of the capstone, students are able to demonstrate their ability to think critically about data, communicate with diverse audiences, and advance innovation in ways that benefit society.

Capstone Proposal Tutorial and Capstone Sequencing

The semester prior to capstone enrollment (no earlier), you register for the on-campus precapstone: CSCI E-597 Data Science Precapstone . Ordinarily the on-campus precapstone tutorial is offered during the three-week January session and one, three-week summer session.

The Precapstone prepares students to explore interdisciplinary research topics from a variety of industries and areas. Through workshops and collaborating with experts from different disciplines, students identify research topics, apply the appropriate data science methods, and use data to advance innovative solutions. Students receive guidance and advising to work effectively in teams, refine project proposals, and build the domain knowledge necessary in their selected area. By the end of the course, each team submits a detailed research proposal, including project rationale, methods, and expected outcomes, which they intend to execute during CSCI E-599a.

The semester right after the precapstone, you enroll in the online capstone, CSCI E-599a Data Science Capstone , as your final one-and-only course, either in the fall or the spring. Due to the heavy demands of the capstone, it is considered a full-time course.  All other degree requirements must be fulfilled so you can draw upon your entire ALM training to produce a final project worthy of a Harvard degree.

Sample Pathway

You need to complete 12 courses (48 credits) to earn the degree. 

  • You'll register for the precapstone in the summer as your 11th course. Then in the fall, you'll register for the capstone as your 12th and final course.
  • You'll register for the precapstone in the January term as your 11th course. Then in the spring, you'll register for the capstone as your 12th and final course.

Bruce Huang, EdD, PhD, Director of Master's Degree Program in Information Technology, Harvard Extension School

cds official logo

NYU Center for Data Science

Harnessing Data’s Potential for the World

Master’s in Data Science

  • Industry Concentration
  • Admission Requirements
  • Capstone Project
  • Summer Research Initiative
  • Financial Aid
  • MS Admissions Ambassadors
  • Summer Initiative

CDS master’s students have a unique opportunity to solve real-world problems through the capstone course in the final year of their program. The capstone course is designed to apply knowledge into practice and to develop and improve critical skills such as problem-solving and collaboration skills.

Students are matched with research labs within the NYU community and with industry partners to investigate pressing issues, applying data science to the following areas:

  • Probability and statistical analyses
  • Natural language processing
  • Big Data analysis and modeling
  • Machine learning and computational statistics
  • Coding and software engineering
  • Visualization modeling
  • Neural networks
  • Signal processing
  • High dimensional statistics

Capstone projects present students with the opportunity to work in their field of interest and gain exposure to applicable solutions. Project sponsors, NYU labs, and external partners, in turn receive the benefit of having a new perspective applied to their projects.

“Capstone is a unique opportunity for students to solve real world problems through projects carried out in collaboration with industry partners or research labs within the NYU community,” says capstone advisor and CDS Research Fellow Anastasios Noulas. “It is a vital experience for students ahead of their graduation and prior to entering the market, as it helps them improve their skills, especially in problem solving contexts that are atypical compared to standard courses offered in the curriculum. Cooperation within teams is another crucial skill built through the Capstone experience as projects are typically run across groups of 2 to 4 people.”

The Capstone Project offers the opportunity for organizations to propose a project that our graduate students will work on as part of their curriculum for one semester. Information on the course along with a questionnaire to propose a project, can be found on the Capstone Fall 2024 Project Submission Form . If you have any questions, please reach out to [email protected] .

Best Fall 2023 Capstone Posters

sample capstone project for data science

Multimodal NLP for M&A Agreements

Student Authors: Harsh Asrani, Chaitali Joshi, Tayyibah Khanam, Ansh Riyal | Project Mentors: Vlad Kobzar, Kyunghyun Cho

sample capstone project for data science

  • Partisan Bias and the US Federal Court System

Student Authors: Annabelle Huether, Mary Nwangwu, Allison Redfern | Project Mentors: Aaron Kaufman, Jon Rogowski

Best Fall 2023 Student Voted Posters

sample capstone project for data science

User-Centric AI Models for Assisting the Blind

Student Authors: Gail Batutis, Aradhita Bhandari, Aryan Jain, Mallory Sico | Project Mentors: Giles Hamilton-Fletcher, Chen Feng, Kevin C. Chan

sample capstone project for data science

  • Multi-Modal Foundation Models for Medicine

Student Authors: Yunming Chen, Harry Huang, Jordan Tian, Ning Yang | Project Mentors: Narges Razavian

Best Fall 2023 Student Voted Runner-Up Posters

sample capstone project for data science

  • Representational geometry of learning rules in neural networks

Student Authors: Ghana Bandi, Shiyu Ling, Shreemayi Sonti, Zoe Xiao | Project Mentors: SueYeon Chung, Chi-Ning Chou

sample capstone project for data science

  • Medical Data Leakage with Multi-site Collaborative Training

Student Authors: Christine Gao, Ciel Wang, Yuqi Zhang | Project Mentors: Qi Lei

Fall 2023 Capstone Project List

  • Segmentation of Metastatic Brain Tumors Using Deep Learning
  • Discovering misinformation narratives from suspended tweets using embedding-based clustering algorithms
  • Network Intrusion Detection Systems using Machine Learning
  • Knowledge Extraction from Pathology Reports Using LLMs
  • Building an Interactive Browser for Epigenomic & Functional Maps from the Viewpoint of Disease Association
  • Prediction of Acute Pancreatitis Severity Using CT Imaging and Deep Learning
  • User-centric AI models for assisting the blind
  • A machine learning model to predict future kidney function in patients undergoing treatment for kidney masses
  • Fine-Tuning of MedSAM for the Automated Segmentation of Musculoskeletal MRI for Bone Topology Evaluation and Radiomic Analysis
  • Online News Content Neural Network Recommendation Engine
  • Explanatory Modeling for Website Traffic Movements
  • Egocentric video zero-shot object detection
  • Leverage OncoKB’s Curated Literature Database to Build an NLP Biomarker Identifier
  • Improving Out-of-Distribution Generalization in Neural Models for Astrophics and Cosmology?
  • Preparing a Flood Risk Index for the State of Assam, India
  • Causal GANs
  • Bringing Structure to Emergent Taxonomies from Open-Ended CMS Tags
  • Social Network Analysis of Hospital Communication Networks
  • Multimodal Question Answering
  • Does resolution matter for transfer learning with satelitte imagery?
  • Measuring Optimizer-Agnostic Hyperparameter Tuning Difficulty
  • Extracting causal political narratives from text.
  • Designing Principled Training Methods for Deep Neural Networks
  • Multimodal NLP for M&A Agreements
  • Using Deep Learning to Solve Forward-Backward Stochastic Differential Equations
  • OptiComm: Maximizing Medical Communication Success with Advanced Analytics
  • Automated assessment of epilepsy subtypes using patient-generated language data
  • Predicting cancer drug response of patients from their alteration and clinical data
  • Identify & Summarize top key events for a given company from News Data using ML and NLP Models
  • Developing predictive shooting accuracy metric(s) for First-Person-Shooter esports
  • Supporting Student Success through Pipeline Curricular Analysis
  • Transformers for Electronic Health Records
  • Build Models for Multilingual Medical Coding
  • Metadata Extraction from Spoken Interactions Between Mothers and Young Children
  • Uncertainty Radius Selection in Distributionally Robust Portfolio Optimization
  • Unveiling Insights into Employee Benefit Plans and Insurance Dynamics
  • Advanced Name Screening and Entity Linking Using large language models
  • What Keeps the Public Safe While Avoiding Excessive Use of Incarceration? Supporting Data-Centered Decisionmaking in a DA’s Office
  • Foundation Models for Brain Imaging
  • Housing Price Forecasting – Alternative Approaches
  • Evaluating the Capability of Large Language Models to Measure Psychiatric Functioning
  • Predicting year-end success using deep neural network (DNN) architecture

Best Fall 2022 Capstone Posters

Leveraging Computer Vision to Map Cell Tower Locations to Enhance School Connectivity poster

  • Leveraging Computer Vision to Map Cell Tower Locations to Enhance School Connectivity

Student Authors: Lorena Piedras, Priya Dhond, and Alejandro Sáez | Mentors: Iyke Derek Maduako (UNICEF)

Neural Re-Ranking for Personalized Home Search poster

  • Neural Re-Ranking for Personalized Home Search

Student Authors: Giacomo Bugli, Luigi Noto, Guilherme Albertini | Mentors: Shourabh Rawat, Niranjan Krishna, and Andreas Rubin-Schwarz

Sequence Modeling for Query Understanding & Conversational Search poster

Sequence Modeling for Query Understanding & Conversational Search

Student Authors: Lucas Tao, Evelyn Wang, Jun Wang, Cecilia Wu | Mentors: Amir Rahmani, Arun Balagopalan, Shourabh Rawat, and Najoung Kim

 Solving challenging video games in human-like ways poster

  • Solving challenging video games in human-like ways

Student Authors: Brian Pennisi, Jiawen Wu, Adeet Patel, and Sarvesh Patki | Mentors: Todd Gureckis (NYU)

Best Fall 2022 Student Voted Posters

Deep Learning Framework for Segmentation of Medical Images poster

  • Deep Learning Framework for Segmentation of Medical Images

Student Authors: Luoyao Chen, Mei Chen, Jinqian Pan | Mentors: Jacopo Cirrone (NYU)

Galaxy Dataset Distillation poster

  • Galaxy Dataset Distillation

Student Authors: Xu Han, Jason Wang, Chloe Zheng | Mentors: Julia Kempe (NYU)

Best Fall 2022 Runner-Up Posters

Dementia Detection from FLAIR MRI via Deep Learning poster

  • Dementia Detection from FLAIR MRI via Deep Learning

Student Authors: Jiawen Fan, Aiqing Li | Mentors: Narges Razavian (NYU Langone)

Ego4d NLQ: Egocentric Visual Learning of Representations and Episodic Memory poster

  • Ego4d NLQ: Egocentric Visual Learning of Representations and Episodic Memory

Student Authors: Dongdong Sun; Rui Chen; Ying Wang | Mentors: Mengye Ren (NYU)

Learning User Representations from Zillow Search Sessions using Transformer Architectures poster

  • Learning User Representations from Zillow Search Sessions using Transformer Architectures

Student Authors: Xu Han, Jason Wang, Chloe Zheng | Mentors: Shourabh Rawat (Zillow Group)

Methane Emission Quantification through Satellite Images poster

  • Methane Emission Quantification through Satellite Images

Student Authors: Alex Herron, Dhruv Saxena, Xiangyue Wang | Mentors: Robert Huppertz (orbio.earth)

Fall 2022 Capstone Project List

  • Data Science for Clinical Decision-making Support in Radiation Therapy
  • Using Voter File Data to Study Electoral Reform
  • Creating an Epigenomic Map of the Heart
  • Career Recommendation
  • Calibrating for Class Weights
  • Assigning Locations to Detected Stops using LSTM
  • Impact of YMCA Facilities on the Local Neighborhoods of Bronx
  • Powering SMS Product Recommendations with Deep Learning
  • Evaluation and Performance Comparison of Two Models in Classifying Cosmological Simulation Parameters
  • Crypto Anomaly Detection
  • Sequence Modeling for Query Understanding & Conversational Search
  • Multi-Modal Graph Inductive Learning with CLIP Embeddings
  • Multimodal Contract Segmentation
  • Extraction of Causal Narratives from News Articles
  • Detecting Erroneous Geospatial Data
  • Improving Speech Recognition Performance using Synthetic Data
  • Multi-document Summarization for News Events
  • Multi-task learning in orthogonal low dimensional parameter manifolds
  • Let’s Go Shopping: An Investigation Into a New Bimodal E-Commerce Dataset
  • Training AI to recognize objects of interest to the blind community
  • Classify Classroom Activities using Ambient Sound
  • Database and Dashboard for RII
  • Bitcoin Price Prediction Using Machine Learning Models
  • Context Driven Approach to Detecting Cross-Platform Coordinated Influence Campaigns
  • Invalid Traffic Detection Model Deployment
  • Recalled Experiences of Death: Using Transformers to Understand Experiences and Themes
  • Context-Based Content Extraction & Summarization from News Articles
  • Neural Learning to Rank for Personalized Home Search
  • Improve Speech Recognition Performance Using Unpaired Audio and Text
  • Data Normalization & Generalization to Population Metrics
  • Automated Judicial Case Briefing
  • Cyber Threat Detection for News Articles
  • MLS Fan Segmentation
  • Near Real-Time Estimation of Beef and Dairy Feedlot Greenhouse Gas Emissions
  • Do Better Batters Face Higher or Lower Quality Pitches?

Previous Capstone Projects

Best fall 2021 capstone posters.

sample capstone project for data science

  • Question Answering on Long Context

Student Authors: Xinli Gu, Di He, Congyun Jin | Project Mentor: Jocelyn Beauchesne (Hyperscience)

sample capstone project for data science

Multimodal Self-Supervised Deep Learning with Chest X-Rays and EHR Data

Student Authors: Adhham Zaatri, Emily Mui, Yechan Lew | Project Mentor: Sumit Chopra (NYU Langone)

sample capstone project for data science

Head and Neck CT Segmentation Using Deep Learning

Student Authors: Pengyun Ding, Tianyu Zhang | Project Mentor: Ye Yuan (NYU Langone)

sample capstone project for data science

  • 3D Astrophysical Simulation with Transformer

Student Authors: Elliot Dang, Tong Li, Zheyuan Hu | Project Mentor: Shirley Ho (Flatiron Institute)

sample capstone project for data science

Multimodal Representations for Document Understanding (Best Student Voted Poster)

Student Authors: Pavel Gladkevich, David Trakhtenberg, Ted Xie, Duey Xu | Project Mentor: Shourabh Rawat (Zillow Group)

2021 Capstone Project List

  • Accelerated Learning in the Context of Language Acquisition
  • Analysis of Cardiac Signals on Patients with Atrial Fibrillation
  • Applications of Neural Radiance Fields in Astronomy
  • Automatic Detection of Alzheimer’s Disease with Multi-Modal Fusion of Clinical MRI Scans
  • Automatic Transcription of Speech on SAYCam
  • Automatic Volumetric Segmentation of Brain Tumor Using Deep Learning for Radiation Oncology
  • Automatically Identify Applicants Who Require Physician’s Reports
  • Building a Question-Answer Generation Pipeline for The New York Times
  • Coupled Energy-Based Models and Normalizing Flows for Unsupervised Learning
  • Data Classification Processing for Clinical Decision-making Support in Radiation Therapy
  • Deep Active Learning for Protest Detection
  • Estimating Intracranial Pressure Using OCT Scans of the Eyeball
  • Graph Neural Networks for Electronic Health Record (EHR) Data
  • Head and Neck CT Image Segmentation
  • Head Movement Measurement During Structural MRI
  • Image Segmentation for Vestibular Schwannoma
  • Investigation into the Functionality of Key, Query, Value Sub-modules of a Transformer
  • Know Your Worth: An Analysis of Job Salaries
  • Machine learning-based computational phenotyping of electronic health records
  • Modeling the Speed Accuracy Tradeoff in Decision-Making
  • Multi-modal Breast Cancer Detection
  • Multi-Modal Deep Learning with Medical Images and EHR Data
  • Multimodal Representations for Document Understanding
  • Nematode Counting
  • News Clustering and Summarization
  • Post-surgical resection mapping in epilepsy using CNNs
  • Predicting Grandstanding in the Supreme Court through Speech
  • Predicting Probability of Post-Colectomy Hospital Readmission
  • Prediction of Total Knee Replacement Using Radiographs and Clinical Risk Factors
  • Reinforcement Learning for Option Hedging
  • Representation Learning Regarding RNA-RBP Binding
  • Self-Supervised Learning of Medical Image Representations Using Radiology Reports
  • The Study of American Public Policy with NLP
  • Topical Aggregation and Timeline Extraction on the NYT Corpus
  • Unsupervised Deep Denoiser for Electron-Microscope Data
  • Using Deep Learning and FBSDEs to Solve Option Pricing and Trading Problems
  • Vision Language Models for Real Estate Images and Descriptions

Featured 2020 Capstone Projects

Speak or Chat with Me Paper Chart

Speak or Chat with Me: End-to-End Spoken Language Understanding System with Flexible Inputs

By Sujeong Cha, Wangrui Hou, Hyun Jung, My Phung, Michael Picheny, Hong-Kwang Kuo, Samuel Thomas, Edmilson MoraisJain

Accented Speech Paper Chart

Accented Speech Recognition Inspired by Human Perception

By Xiangyun Chu, Elizabeth Combs, Amber Wang, Michael Picheny

Diarization of Legal Proceedings Paper Chart

Diarization of Legal Proceedings. Identifying and Transcribing Judicial Speech from Recorded Court Audio

By Jeffrey Tumminia, Amanda Kuznecov, Sophia Tsilerides, Ilana Weinstein, Brian McFee, Michael Picheny, Aaron R. Kaufman

2020 Capstone Project List

  • 2D to 3D Video Generation for Surgery (Best Capstone Poster)
  • Action Primitive Recognition with Sequence to Sequence Models towards Stroke Rehabilitation
  • Applying Self-learning Methods on Histopathology Whole Slide Images
  • Applying Transformers Models to Scanned Documents: An Application in Industry
  • Beyond Bert-based Financial Sentimental Classification: Label Noise and Company Information
  • Bias and Stability in Hiring Algorithms (Best Capstone Poster)
  • Breast Cancer Detection using Self-supervised Learning Method
  • Catastrophic Forgetting: An Extension of Current Approaches (Best Capstone Poster)
  • ClinicalLongformer: Public Available Transformers Language Models for Long Clinical Sequences
  • Complication Prediction of Bariatric Surgery
  • Constraining Search Space for Hardware Configurations
  • D4J: Data for Justice to Advance Transparency and Fairness
  • Data-driven Diesel Insights
  • Deep Learning to Study Pathophysiology in Dermatomyositis
  • Detection Of Drug-Target Interactions Using BioNLP
  • Determining RNA Alternative Splicing Patterns
  • Developing a Data Ecosystem for Refugee Integration Insights
  • Diarizing Legal Proceedings
  • Estimating the Impact of the Home Health Value-Based Purchasing Model
  • Extracting economic sentiment from mainstream media articles
  • Food Trend Detection in Chinese Financial Market
  • Forecasting Biodiesel Auction Prices
  • Generative Adversarial Networks for Electron Microscope Image Denoising
  • Graph Embedding for Question Answering over Knowledge Graphs
  • Impact of NYU Wasserman Resources on Students’ Career Outcomes
  • Improving Accented Speech Recognition Through Multi-Accent Pre-Exposure
  • Improving Synthetic Image Generation for Better Object Detection
  • Learning-based Model for Super-resolution in Microscopy Imaging
  • Modeling Human Reading by a Grapheme-to-Phoneme Neural Network
  • Movement Classification of Macaque Neural Activity
  • New OXXO Store in Brazil and Revenue Prediction
  • Numerical Relativity Interpolations using Deep Learning
  • One Medical Passport: Predictive Obstructive Sleep Apnea Analysis
  • Online Student Pathways at New York University
  • Predicting YouTube Trending Video Project
  • Promotional Forecasting Model for Profit Optimization
  • Question Answering on Tabular Data with NLP
  • Raizen Fuel Demand Forecasting
  • Reach for the stars: detecting astronomical transients
  • Reverse Engineering the MOS 6502 Microprocessor
  • Selecting Optimal Training Sets
  • Synthesizing baseball data with event prediction pretraining
  • Train ETA Estimation for Rumo S.A.
  • Training a Generalizable End-to-End Speech-to-Intent Model
  • Utilizing Machine Learning for Career Advancement and Professional Growth

Best Fall 2019 Capstone Projects

Wikipedia Articles poster

  • Inferring the Topic(s) of Wikipedia Articles

By Marina Zavalina, Sarthak Agarwal, Chinmay Singhal, Peeyush Jain

portfolio replication poster

Option Portfolio Replication and Hedging in Deep Reinforcement Learning

By Bofei Zhang, Jiayi Du, Yixuan Wang, Muyang Jin

Deep-Learning Regressions in Astronomy poster

Adversarial Attacks Against Linear and Deep-Learning Regressions in Astronomy

By Teresa Huang, Zacharie Martin, Greg Scanlon, Eva Wang Mentors: Soledad Villar, David W. Hogg

2019 Capstone Project List

  • Adversarial Attacks Against Linear and Deep-learning Regressions in Astronomy
  • Automated Breast Cancer Screening
  • Automatic Legal Case Summaries
  • Cross-task Transfer Between Language Understanding Tasks in NLP
  • Dark Matter and Stellar Stream Detection using Deep Learned Clustering
  • Exploiting Google Street View to Generate Global-scale Data Sets for Training Next Generation Cyber-Physical Systems
  • Federated Incremental Learning
  • Fraud Detection in Monetary Transactions Between Bank Accounts
  • Guided Image Upsampling
  • Improving State of the Art Cross-Lingual Word-Embeddings
  • Latent Semantic Topics Distribution Over Web Content Corpus
  • Lease Renewal Probability Prediction
  • Machine Learning for Adaptive Fuzzy String Matching
  • Market Segmentation from Retailer Behavior
  • Modeling the Experienced Dental Curriculum from Student Data
  • Modelling NBA Games
  • Movie Preference Prediction
  • MRI Image Reconstruction
  • NLP Metalearning
  • Predict next sales office location

Predicting Stock Market Movements using Public Sentiment Data & Sequential Deep Learning Models

  • Predictive Maintenance Techniques
  • Reinforcement Learning for Replication and Hedging of Option
  • Self-supervised Machine Listening

Sentence Classification of TripAdvisor ‘Points-of-Interest’ Reviews

  • Simulating the Dark Matter Distribution of the Universe with Deep Learning
  • SMaPP2: Joint Embedding of User-content and Network Structure to Enable a Common coordinate that captures ideology, geography and user topic spectrum.”
  • Sparse Deconvolution Methods for Microscopy Imaging Data Analysis
  • Stereotype and Unconscious Bias in Large Datasets
  • Structuring Exploring and Exploiting NIH’s Clinical Trials Database
  • The Analysis, Visualization, and Understanding of Big Urban Noise Data
  • Unsupervised and Self-supervised Learning for Medical Notes
  • Unsupervised Generative Video Dubbing
  • Using Deep Generative Models to de-noise Noisy Astronomical Data

Featured Academic Capstone Projects

deep learning poster

Deep Learning for Breast Cancer Detection

By Jason Phang, Jungkyu (JP) Park, Thibault Fevry, Zhe Huang, The B-Team

Brain segmentation poster

Brain Segmentation Using Deep Learning

By Team 22/7 | Chaitra V. Hegde | Advisor: Narges Razavian

Knee replacement poster

Predict Total Knee Replacement Using MRI With Supervised and Semi-Supervised Networks

By Team Glosy: Hong Gao, Mingsi Long, Yulin Shen, and Jie Yang

Featured Industry Capstone Projects

accern logo

Determining where New York Life Insurance should open its next sales office

BK Nets logo

NBA Shot Prediction with Spatio-Temporal Analysis

Other past capstone projects.

  • Active Physical Inference via Reinforcement Learning
  • Deep Multi-Modal Content-User Embeddings for Music Recommendation
  • Fluorescent Microscopy Image Restoration
  • Learning Visual Embeddings for Reinforcement Learning
  • Offensive Speech Detection on Twitter
  • Predicting Movement Primitives in Stroke Patients using IMU Sensors
  • Recurrent Policy Gradients For Smooth Continuous Control
  • The Quality-Quantity Tradeoff in Deep Learning
  • Trend Modeling in Childhood Obesity Prediction
  • Twitter Food/Activity Monitor

Warning icon

Thesis/Capstone for Master's in Data Science | Northwestern SPS - Northwestern School of Professional Studies

  • Post-baccalaureate
  • Undergraduate
  • Professional Development
  • Pre-College
  • Center for Public Safety
  • Get Information

SPS Logo

Data Science

Capstone and thesis overview.

Capstone and thesis are similar in that they both represent a culminating, scholarly effort of high quality. Both should clearly state a problem or issue to be addressed. Both will allow students to complete a larger project and produce a product or publication that can be highlighted on their resumes. Students should consider the factors below when deciding whether a capstone or thesis may be more appropriate to pursue.

A capstone is a practical or real-world project that can emphasize preparation for professional practice. A capstone is more appropriate if:

  • you don't necessarily need or want the experience of the research process or writing a big publication
  • you want more input on your project, from fellow students and instructors
  • you want more structure to your project, including assignment deadlines and due dates
  • you want to complete the project or graduate in a timely manner

A student can enroll in MSDS 498 Capstone in any term. However, capstone specialization courses can provide a unique student experience and may be offered only twice a year. 

A thesis is an academic-focused research project with broader applicability. A thesis is more appropriate if:

  • you want to get a PhD or other advanced degree and want the experience of the research process and writing for publication
  • you want to work individually with a specific faculty member who serves as your thesis adviser
  • you are more self-directed, are good at managing your own projects with very little supervision, and have a clear direction for your work
  • you have a project that requires more time to pursue

Students can enroll in MSDS 590 Thesis as long as there is an approved thesis project proposal, identified thesis adviser, and all other required documentation at least two weeks before the start of any term.

From Faculty Director, Thomas W. Miller, PhD

Tom Miller

Capstone projects and thesis research give students a chance to study topics of special interest to them. Students can highlight analytical skills developed in the program. Work on capstone and thesis research projects often leads to publications that students can highlight on their resumes.”

A thesis is an individual research project that usually takes two to four terms to complete. Capstone course sections, on the other hand, represent a one-term commitment.

Students need to evaluate their options prior to choosing a capstone course section because capstones vary widely from one instructor to the next. There are both general and specialization-focused capstone sections. Some capstone sections offer in individual research projects, others offer team research projects, and a few give students a choice of individual or team projects.

Students should refer to the SPS Graduate Student Handbook for more information regarding registration for either MSDS 590 Thesis or MSDS 498 Capstone.

Capstone Experience

If students wish to engage with an outside organization to work on a project for capstone, they can refer to this checklist and lessons learned for some helpful tips.

Capstone Checklist

  • Start early — set aside a minimum of one to two months prior to the capstone quarter to determine the industry and modeling interests.
  • Networking — pitch your idea to potential organizations for projects and focus on the business benefits you can provide.
  • Permission request — make sure your final project can be shared with others in the course and the information can be made public.
  • Engagement — engage with the capstone professor prior to and immediately after getting the dataset to ensure appropriate scope for the 10 weeks.
  • Teambuilding — recruit team members who have similar interests for the type of project during the first week of the course.

Capstone Lesson Learned

  • Access to company data can take longer than expected; not having this access before or at the start of the term can severely delay the progress
  • Project timeline should align with coursework timeline as closely as possible
  • One point of contact (POC) for business facing to ensure streamlined messages and more effective time management with the organization
  • Expectation management on both sides: (business) this is pro-bono (students) this does not guarantee internship or job opportunities
  • Data security/masking not executed in time can risk the opportunity completely

Publication of Work

Northwestern University Libraries offers an option for students to publish their master’s thesis or capstone in Arch, Northwestern’s open access research and data repository.

Benefits for publishing your thesis:

  • Your work will be indexed by search engines and discoverable by researchers around the world, extending your work’s impact beyond Northwestern
  • Your work will be assigned a Digital Object Identifier (DOI) to ensure perpetual online access and to facilitate scholarly citation
  • Your work will help accelerate discovery and increase knowledge in your subject domain by adding to the global corpus of public scholarly information

Get started:

  • Visit Arch online
  • Log in with your NetID
  • Describe your thesis: title, author, date, keywords, rights, license, subject, etc.
  • Upload your thesis or capstone PDF and any related supplemental files (data, code, images, presentations, documentation, etc.)
  • Select a visibility: Public, Northwestern-only, Embargo (i.e. delayed release)
  • Save your work to the repository

Your thesis manuscript or capstone report will then be published on the MSDS page. You can view other published work here .

For questions or support in publishing your thesis or capstone, please contact [email protected] .

Vertical Institute

Featured Student Projects

sample capstone project for data science

Bank Loan Payment Analysis

Data Analytics Capstone Project by Ng Shao Zhi

sample capstone project for data science

Bank Marketing Campaign

Data Analytics Capstone Project by Nur Filzah Bte Jusmani

sample capstone project for data science

Bank Customer Identifying Analysis

Data Analytics Capstone Project by Lim Shue Ling

sample capstone project for data science

Credit Default Risk Analysis

Data Analytics Capstone Project by Jermaine Lee

sample capstone project for data science

Analyzing Customizing Solutions

Data Analytics Capstone Project by Jasmine Teo

sample capstone project for data science

Insurance Analysis

Data Analytics Capstone Project by Claudia Lim

sample capstone project for data science

Insurance Fraud Analysis

Data Analytics Capstone Project by Michelle Eng

sample capstone project for data science

Credit Card Attrition

Data Analytics Capstone Project by Lee Ying Chia

sample capstone project for data science

Reducing Fraudulent Claims Analysis

Data Analytics Capstone Project by Lim Si Xian

sample capstone project for data science

Entry Points Analysis

Data Analytics Capstone Project by Zeph Han

sample capstone project for data science

Credit Card Customers

Data Analytics Capstone Project by Marissa Goh

sample capstone project for data science

Customer Service Analysis

sample capstone project for data science

Retaining Customer Analysis

Data Analytics Capstone Project by Joel Lim

sample capstone project for data science

Analysis on Wealth Management

Data Analytics Capstone Project by Tam Jie Qi

sample capstone project for data science

Customer Retention Analysis

Data Analytics Capstone Project by Su Wei Ng

sample capstone project for data science

Credit Card Department

Data Analytics Capstone Project by Tan Jin Hui

sample capstone project for data science

Bank Service Analysis

sample capstone project for data science

Cryptocurrency Strategy

sample capstone project for data science

Credit Card Fraud Prediction

Data Analytics Capstone Project by Charmaine Neo

sample capstone project for data science

Ireland Loan Default Analysis

Data Analytics Capstone Project by Sophia Lim

sample capstone project for data science

US Credit Card Fraud Report

Data Analytics Capstone Project by Lili Loi

sample capstone project for data science

Online Banking Scams

Data Analytics Capstone Project by Felicia Chua

sample capstone project for data science

Identifying Customer Segments

Data Analytics Capstone Project by Joey Tan

sample capstone project for data science

Credit Customer Attrition

sample capstone project for data science

Health Insurance Analysis

Data Analytics Capstone Project by Victoria Leong

Project for Data Analysis - Vertical Institute

European Bank Customer Retention

Data Analytics Capstone Project by Michelle Leong

Data Visualisation Projects - Vertical Institute

S&P 500 Exchange-Traded Fund (ETF)

Data Analytics Capstone project by Daniel C Lim

Data Visualisation Projects - Vertical Institute

How to Increase Credit Card Retention Rate

Data Analytics Capstone project by Eugina Pek

Data Analyst Projects Project - Vertical Institute

Products streamline for our customers

Data Analytics Capstone Project by Tracy Bay

Data Analytics Project | Vertical Institute

Predicting Annual Premiums for Customers

Data Analytics Capstone project by Suhashini

Data Analytics Capstone Ideas - Vertical Institute

Home Loan Eligibility Analysis

Data Analytics Capstone project by Jasmine Yeo

Gain A Headstart In Your Career Today

sample capstone project for data science

Cedars-Sinai logo

  • Departments
  • Anesthesiology
  • Biomedical Sciences
  • Cardiac Surgery
  • Computational Biomedicine
  • Neurosurgery
  • Obstetrics & Gynecology
  • Orthopaedics
  • Pathology & Laboratory Medicine
  • Physical Medicine & Rehabilitation
  • Psychiatry & Behavioral Neurosciences
  • Radiation Oncology
  • Board of Governors Regenerative Medicine Institute
  • F. Widjaja Inflammatory Bowel Disease Institute
  • Samuel Oschin Comprehensive Cancer Institute
  • Smidt Heart Institute
  • Advanced Clinical Biosystems Research Institute
  • Biomedical Imaging Research Institute
  • Diabetes & Obesity Research Institute
  • Geri & Richard Brawerman Nursing Institute
  • Human Microbiome Research Institute
  • Kao Autoimmunity Institute
  • Maxine Dunitz Neurosurgical Institute
  • Women's Guild Lung Institute
  • Research Topics
  • Laboratories
  • Research Cores
  • Clinical Trials
  • Office of Research Administration
  • Technology & Innovations
  • Clinical & Translational Research Center
  • News & Breakthroughs
  • Graduate Medical Education
  • Graduate School of Biomedical Sciences
  • Continuing Medical Education
  • Professional Training Programs
  • Women's Guild Simulation Center
  • Center for the Arts and Humanities in Medicine
  • Medical Library
  • Campus Life
  • Office of the Dean
  • Academic Calendar
  • Back to Master of Science in Health Systems

Capstone Project

  • Data Analytics Core
  • Healthcare Financing Core
  • Health Informatics Core
  • Performance Measurement and Improvement Core
  • Application Information
  • Program Statistics
  • Faculty & Administration
  • Current Students
  • Student Research
  • News & Related Resources
  • Frequently Asked Questions
  • MSHS Applicants

To complete the MSHS program, students spend 12 months applying classroom theory to the subject of their choosing and produce a presentation about the experience.

Educational Objectives

  • Demonstrate proficiency in applying health systems system (HSS) academic theory into pragmatic, applied problem-solving
  • Appreciate how HSS requires team science, shared decision-making among diverse stakeholders and strong interpersonal communication skills
  • Utilize scientific method to solve HSS problems including hypothesis generation, literature search and approaches to quantization, and presentation of results to leadership
  • Become proficient in oral and written communication of HSS analyses and results, learning how to tell a story with data in a way that engages stakeholders and ultimately leads to improved healthcare

Additional field-based credits from the capstone program accrue over the final 12 months of the 20-month MSHS program. The capstone project may be conducted on campus at Cedars-Sinai or in other approved healthcare organizations.

Students work on their capstone project on a schedule agreed upon with their primary mentor. Journal clubs, mentorship meetings and other program events may occur during typical work hours and are available via web conferencing for off-site students.

The MSHS faculty believes it is vital to expose students to a wide range of learning experiences; success in health systems science requires not only a strong theoretical basis but also pragmatic experiential learning to solidify classroom theory.

Capstone Project: Series Overview

The HSS 204 series includes four lockstep courses that build upon one another and culminate in completion of the capstone, delivery of a final written report and oral presentation of the report to Cedars-Sinai leaders. The sequence is as follows:

Group of people meeting in an office.

Students attend a biweekly seminar in healthcare leadership, where they develop a framework to assume a leadership role in the capstone project as a model for leadership in future organizations. The seminar series consists of highly engaging, interactive didactic sessions that promote discussion and learner engagement. During each session, leaders from diverse areas of the organization share their experience and expertise. Additionally, students prepare presentations based on assigned reading materials, which are followed by interactive discussions about leadership and personal development.

Large group of people in a meeting room.

Project identification and literature review. Students will work with their mentor and an assigned peer-partner to identify an area of opportunity within a healthcare organization that they wish to analyze for their capstone project. They will also perform a literature search to familiarize themselves with the subject. The course culminates in a formal work-in-progress presentation to the course directors and other students.

A man and woman using a laptop.

Stakeholder analysis and development of quantitative analysis plan. Students will identify relevant stakeholders for their project and will perform stakeholder interviews. Students will also develop a plan for the quantitative analysis they will perform during the final step of their capstone project. The course culminates in a formal work-in-progress presentation to the course directors and other students.

Woman writing in her notebook.

Quantitative analysis and final report. Students will perform a quantitative analysis of their choosing, such as a cost-effectiveness analysis, an analysis of existing data, or a meta-analysis. The course culminates in a formal written and oral presentation of the proposal to a committee composed of the program leaders and selected health system leaders. The presentation is open to the entire health system in a public-facing, large-scale forum.

Follow the link to see examples of student projects.

Capstone Project Examples

Students at Cedars-Sinai have access to hundreds of potential capstone projects throughout the health system. Examples of recent student projects include:

  • Program Evaluation of Clinical Decision Support to Prevent Inappropriate Resuscitations
  • Cedars-Sinai Patient Experience and Health Equity Analysis using HCAHPS Survey Scores
  • Reducing Intraoperative Costs Through Supply Standardization and Tray Consolidation
  • Pain Management with Early Regional Anesthesia in Geriatric Hip Fracture Patients
  • Pharmacy Transitions of Care: A Discharge Reconciliation Pilot
  • Machine Learning in Labor and Delivery

Have Questions or Need Help?

If you have questions or wish to learn more about the MSHS program, please contact:

Graduate School of Biomedical Sciences 8687 Melrose Ave. Suite G-532 West Hollywood, CA 90069

Capstone Projects

Online M.S. in Data Science students are required to complete a capstone project. Capstone projects challenge students to acquire and analyze data to solve real-world problems. Project teams consist of two to four students and a faculty advisor. Teams select their capstone project in term 4 and work on the project in term 5, which is their final term.

Most projects are sponsored by an organization—academic, commercial, non-profit, and government—seeking valuable recommendations to address strategic and operational issues. Depending on the needs of the sponsor, teams may develop web-based applications that can support ongoing decision-making. The capstone project concludes with a paper and presentation.

Key takeaways:

  • Synthesizing the concepts you have learned throughout the program in various courses (this requires that the question posed by the project be complex enough to require the application of appropriate analytical approaches learned in the program and that the available data be of sufficient size to qualify as ‘big’)
  • Experience working with ‘raw’ data exposing you to the data pipeline process you are likely to encounter in the ‘real world’  
  • Demonstrating oral and written communication skills through a formal paper and presentation of project outcomes  
  • Acquisition of team building skills on a long-term, complex, data science project 
  • Addressing an actual client's need by building a data product that can be shared with the client

Capstone projects have been sponsors by a variety of organizations and industries, including: Capital One, City of Charlottesville, Deloitte Consulting LLP, Metropolitan Museum of Art, MITRE Corporation, a multinational banking firm, The Public Library of Science, S&P Global Market Intelligence, UVA Brain Institute, UVA Center for Diabetes Technology, UVA Health System, U.S. Army Research Laboratory, Virginia Department of Health, Virginia Department of Motor Vehicles, Virginia Office of the Governor, Wikipedia, and more. 

Sponsor a Capstone Project  

View previous examples of capstone projects  and check out answers to frequently asked questions. 

What does the process look like?

  • The School of Data Science periodically puts out a  Call for Proposals . Prospective project sponsors submit official proposals, vetted by the Associate Director for Research Development, Capstone Director, and faculty.
  • Sponsors present their projects to students at “Pitch Day” during Semester 4, where students have the opportunity to ask questions.
  • Students individually rank their top project choices. An algorithm sorts students into capstone groups of approximately 3 to 4 students per group.
  • Adjustments are made by hand as necessary to finalize groups.
  • Each group is assigned a faculty mentor, who will meet groups each week in a seminar-style format.  

What is the seminar approach to mentoring capstones?

We utilize a seminar approach to managing capstones to provide faculty mentorship and streamlined logistics. This approach involves one mentor supervising three to four loosely related projects and meeting with these groups on a regular basis. Project teams often encounter similar roadblocks and issues so meeting together to share information and report on progress toward key milestones is highly beneficial.

Do all capstone projects have corporate sponsors?

Not necessarily. Generally, each group works with a sponsor from outside the School of Data Science. Some sponsors are corporations, some are from nonprofit and governmental organizations, and some are from in other departments at UVA.

One of the challenges we continue to encounter when curating capstone projects with external sponsors is appropriately scoping and defining a question that is of sufficient depth for our students, obtaining data of sufficient size, obtaining access to the data in sufficient time for adequate analysis to be performed and navigating a myriad of legal issues (including conflicts of interest). While we continue to strive to use sponsored projects and work to solve these issues, we also look for ways to leverage openly available data to solve interesting societal problems which allow students to apply the skills learned throughout the program. While not all capstones have sponsors, all capstones have clients. That is, the work is being done for someone who cares and has investment in the outcome. 

Why do we have to work in groups?

Because data science is a team sport!

All capstone projects are completed by group work. While this requires additional coordination , this collaborative component of the program reflects the way companies expect their employees to work. Building this skill is one of our core learning objectives for the program.

I didn’t get my first choice of capstone project from the algorithm matching. What can I do?

Remember that the point of the capstone projects isn’t the subject matter; it’s the data science. Professional data scientists may find themselves in positions in which they work on topics assigned to them, but they use methods they enjoy and still learn much through the process. That said, there are many ways to tackle a subject, and we are more than happy to work with you to find an approach to the work that most aligns with your interests.

Why don’t we have a say in the capstone topics?

Your ability to influence which project you work on is in the ranking process after “pitch day” and in encouraging your company or department to submit a proposal during the Call for Proposal process. At a minimum it takes several months to work with a sponsor to adequately scope a project, confirm access to the data and put the appropriate legal agreements into place. Before you ever see a project presented on pitch day, a lot of work has taken place to get it to that point!

Can I work on a project for my current employer?

Each spring, we put forward a public call for capstone projects. You are encouraged to share this call widely with your community, including your employer, non-profit organizations, or any entity that might have a big data problem that we can help solve. As a reminder, capstone projects are group projects so the project would require sufficient student interest after ‘pitch day’. In addition, you (the student) cannot serve as the project sponsor (someone else within your employer organization must serve in that capacity).

If my project doesn’t have a corporate sponsor, am I losing out on a career opportunity?

The capstone project will provide you with the opportunity to do relevant, high-quality work which can be included on a resume and discussed during job interviews. The project paper and your code on Github will provide more career opportunities than the sponsor of the project. Although it does happen from time to time, it is rare that capstones lead to a direct job offer with the capstone sponsor's company. Capstone projects are just one networking opportunity available to you in the program.

Capstone Project Reflections From Alumni

Theo Braimoh, MSDS Online Graduate and Admissions Student Ambassador

For my Capstone project, I used Python to train machine learning models for visual analysis – also known as computer vision. Computer vision helped my Capstone team analyze the ergonomic posture of workers at risk of developing musculoskeletal injuries. We automated the process, and hope our work further protects the health and safety of American workers.”  — Theophilus Braimoh, MSDS Online Program 2023, Admissions Student Ambassador

Haley Egan, MSDS Online 2023 and Admissions Student Ambassador

“My Capstone experience with the ALMA Observatory and NRAO was a pivotal chapter in my UVA Master’s in Data Science journey. It fostered profound growth in my data science expertise and instilled a confidence that I'm ready to make meaningful contributions in the professional realm.” — Haley Egan, MSDS Online Program 2023, Admissions Student Ambassador

Mina Kim, MSDS/PhD 2023

“Our Capstone projects gave us the opportunity to gain new domain knowledge and answer big data questions beyond the classroom setting.”  — Mina Kim, MSDS Residential Program 2023, Ph.D. in Psychology Candidate

Capstone Project Reflections From Sponsors

“For us, the level of expertise, and special expertise, of the capstone students gives us ‘extra legs’ and an extra push to move a project forward. The team was asked to provide a replicable prototype air quality sensor that connected to the Cville Things Network, a free and community supported IoT network in Charlottesville. Their final product was a fantastic example that included clear circuit diagrams for replication by citizen scientists.” — Lucas Ames, Founder, Smart Cville
“Working with students on an exploratory project allowed us to focus on the data part of the problem rather than the business part, while testing with little risk. If our hypothesis falls flat, we gain valuable information; if it is validated or exceeded, we gain valuable information and are a few steps closer to a new product offering than when we started.” — Ellen Loeshelle, Senior Director of Product Management, Clarabridge

Get the latest news

Subscribe to receive updates from the School of Data Science.

  • Prospective Student
  • School of Data Science Alumnus
  • UVA Affiliate
  • Industry Member
  • All Courses
  • Advanced Executive Programs

Advanced Certificate Program in Data Science and AI

sample capstone project for data science

Future of Data Science and AI

Market Size $521.3B

According to Yahoo Finance, the global artificial intelligence (AI) market is expected to reach $521.3 billion by 2028

11.5M New Jobs

According to the U.S. Bureau of Labor Statistics, there will be around 11.5 million new jobs for Data Science professionals by 2026

$119,563/year

The national average salary of a data scientist is $119,563 per year, based on United States Bureau of Labor Statistics

  • Why Edureka?
  • Instructors
  • Admission Process
  • Apply for Scholarship

About the Data Science Program by IIT Guwahati

E&ICT Academy, IIT Guwahati, and Edureka have collaborated to create this Advanced Certificate Program in Data Science and AI. The Indian Institute of Technology, Guwahati, is one of the top 10 engineering institutions as per the latest NIRF rankings and among the top 150 universities in Asia.

This Data Science Program by IIT Guwahati will nurture and transform you into a highly skilled professional, thus helping you land a high-paying job in Data Science and AI. This top-notch AI and Data Science Programs offers hands-on exposure to industry-level assignments and projects.

Get E&ICT Academy, IIT Guwahati Accredited Certification

  • Instructor-Led Live Online Classes
  • 450+ Hours of Intensive Learning
  • 9+ Projects, 20+ Case Studies, 100+ Demos
  • Placement Assistance
  • High Quality Lab Environment

Certificate

certifcate

After completing the Advanced Certificate Program in Data Science and AI, you will receive an industry-recognized Certificate from the E&ICT Academy, Indian Institute of Technology Guwahati (IITG), and Edureka.

certifcatezoom

Who Can Apply for the AI and Data Science Program by IIT Guwahati?

  • Should have studied PCM in 10+2
  • Any undergraduate degree holder like BCA, B.Tech, B.E, B.Sc etc.
  • Any Diploma holder with basic programming knowledge can also apply

Data Science and AI Career Opportunities

The edureka's ai and data science program by iit guwahati advantages.

WHAT I NEED EDUREKA ADVANTAGE WHAT OTHERS OFFER
and exposure relevant to the current industry/job needs Live instructor-led online classes by industry experts
and help to make progress in my learning Dedicated technical and non-technical teams to resolve all your doubts
in India, relevant to the current industry/job needs Curriculum designed by Edureka and approved by E&ICT Academy, IIT Guwahati
, both regular and one-on-one 24x7 subject matter expert support for your technical doubts and non-technical doubts
even after the course completion Lifetime access to our Learning Management System and Edureka Support
which are highly recognized in today's job market Certificate from E&ICT Academy, IIT Guwahati, and Edureka recognized globally
to get a relevant job in a top company Complete access to a plethora of our Career Assistance Services
, unlike any other online platform A unique learning ecosystem to give you an offline-like immersive experience

Hi Pavan! . Your interview slot booking is pending.

Data Science and AI Career Assistance Services

300+ CAS Sessions

Edureka has conducted 300+ Career Mentoring & CAS Webinars for active program batches

180+ Sessions

Conducted over 180+ learner-focused training sessions that include mock interviews sessions and career monitoring sessions

Get access to partner-powered job search portal

sample capstone project for data science

Our Edureka AI and Data Science Program Alumni Work At

Edureka has placed thousands of students in various top tech companies, witnessing the progress of our alumni gives us immense gratification. Their successful journey into the professional world is an inspiration for all of us and we take pride in it. Edureka alumni is working with

...

AI and Data Science Program Review

UserImage

Shanthababu Pandian

Sr. Data & Analytics Technical Delivery Manager

Actually, speaking since 2016, I am upskilling myself with Edureka related to cutting edge technologies these many years. Each course was really helped and helping my career path. Certainly, I should mention Data Science and AI program

affiliation with IIT GUWAHATI., is the great milestone. It was amazing experience with respect to course modules, flow and way which was designed, especially for working professionals like me. Edureka conducted each module with right expectation, with precious content, and brining industry expert’s inputs. And on top the evaluation process for conducting test for each module was much appreciated, since it extracts the actual essence of individuals understanding on subject matters.

Comapny logo

Rajeshwar Reddy

This is one of the best-designed course, very informative and well paced. I was very pleased with the course. Another favorite part of the course are the online labs. This course is suitable for everyone its a good mix of theory and

practical exercises and specifically the sequence of starting straight away with Python to end Industry Project and also going deeper was a very good way of teaching. I would recommend this course to everyone.

Comapny logo

Asmita Dakore

Junior Data scientist

It was very structured LMS and very handy and easy to access and course material is good. It was good experience and Edureka has given me the entire classroom experience which is difficult to achieve in online training.

Comapny logo

Instructors and Mentors

User Image

Sr Principal Data Scientist

Ankit is a Data Science professional with 12+ years of experience. An alumnus of IIT Bombay, he has mentored various startups and taught ML concepts to more than 250+ students. He has published articles in top-notch peer-reviewed journals and magazines and was awarded as the most downloaded paper of the year 2018 for his paper published in Wiley Publications.

Company logo

Sanjay Hingorani

Former Vice-President

Sanjay, Former VP - Decision Science at BNY Mellon Technologies is a Certified Blockchain Solution Architect, holds 22+ years of experience in the Information Technology & Financial Services industry with an emphasis on Digital IT strategy & Business strategy alignment. As an Alumnus of Shivaji University, he is also a respected global senior leader with proven leadership (22 years) in driving technology innovation for business growth.

Company logo

Awanish Golwara

Lead Platform Engineer

Awanish holds 18+ years of experience in the field of IT industry. An alumnus of IIT Kanpur & Georgia Institute of Technology, he had held several administrative posts at IIT Kanpur. Currently, he is designated as the lead platform engineer at FogHorn Inc, a company that offers fog computing, analytics, machine learning, monitoring, and other related solutions.

Company logo

Shirish Singh

Decision Science and Advanced Analytics

Shirish is an Analytical & Consulting Professional who carved excellence in the industry of technology, media, telecom & CPG. In the field of Data Science, he comes with 18+ years of experience in large scale implementations in areas like Natural Language Processing (NLP), Image classification and device error prediction.

Company logo

Sandeep Sharma

Passionate about Customer analytics and solutions for Banking, Retail and Healthcare, Sandeep has played a major role as a Former Vice-President - Products at Eye Care Leaders. With 20+ years global experience in Consulting, Products and Development and an Alumnus of IIT Bombay, he has established revenue generating products worth hundreds of million Dollars.

Company logo

Girijesh Prasad

Senior Manager Data ASG

An undaunted Data Scientist with 8+ years of experience, his humongous technical skills as a DevOps Python Engineer in Python and bash, OOPS in Python, Linux, Unix, VMware ESX, automation on Data Ontap etc has truly gone way above and beyond in achieving his career in various reputed IT and software companies like Netapp and Oracle.

Company logo

Manager - Analytics and Machine learning

Ark has 7+ years of industry experience in Data Science, Deep Learning, Artificial Intelligence, Machine Learning, Data Analysis, Risk Analytics, Business Consulting, Statistical Modeling, Credit Risk Modeling. His expertise lies in GPU acceleration of Deep learning Models using NVIDIA GPU and TensorRT.

Company logo

AI and Data Science Program’s Curriculum

Our AI and Data Science Program Syllabus is balanced to ensure that it meets the needs of a beginner and a Data science and AI expert looking to upskill themselves to suit the current industry and market needs to land a high-paying job.

Python for Data Science and AI

  • Introduction to Data Science
  • Data Collection and Cleaning
  • Python Fundamentals
  • Control Flow and Functions
  • Array Computations using NumPy
  • Data Manipulation using Pandas
  • Visualizing Data using Matplotlib and Seaborn
  • Web Scraping (Self-paced)
  • End-Course Assessment

Predictive Analytics

  • Introduction to Statistical Analysis
  • Exploratory Data Analysis
  • Introduction to Probability
  • Probability Distribution Functions
  • Inferential Statistics - I
  • Inferential Statistics - II
  • Regression (Self-paced)

Machine Learning

  • Introduction to Machine Learning
  • Supervised Learning - Regression
  • Evaluating Regression Models
  • Supervised Learning - Classification
  • Decision Tree and Random Forest Models
  • Mathematical and Bayesian Models
  • Dimensionality Reduction
  • Unsupervised Learning using Clustering
  • Model Evaluation and Hyperparameter Tuning
  • Model Boosting and Optimization
  • Association Rule Mining and Recommendation Engines (Self-paced)
  • Time Series Analysis (Self-paced)
  • End Course Assessment
  • Mid Program Project

Machine Learning on Cloud

  • Cloud Computing and AWS Foundations
  • Conversational AI Development with SageMaker, Lex and Polly
  • Image and Data Analysis with AWS Rekognition, Textract, and Quicksight
  • Machine Learning Model Deployment in AWS

Natural Language Processing

  • Introduction to NLP
  • Text Pre-processing
  • Analyzing Sentence Structure
  • Text Classification
  • Building a Resume Classifier (Self-paced)
  • Building a intent based RASA Chatbot (Self-paced)
  • NLP in Production (Self-paced)

Deep Learning

  • Introduction to Deep Learning
  • Getting started with Tensorflow 2.0 with Tensor Board
  • Neural Networks with TensorFlow 2.x
  • Deep Learning for Images using CNN
  • TensorFlow Hub for Object Detection using Faster RCNN
  • Object Detection Using OpenCV - Part 1
  • Object Detection Using OpenCV - Part 2
  • Deep Learning for Sequences using RNN (Self-paced)

Data Warehousing and Big Data Storage

  • Data Warehousing
  • Data Integration and ETL
  • Getting Started with Big Data
  • Storing Big Data in a Distributed Cluster
  • Data Mining (Self-paced)
  • Frequent Pattern Mining (Self-paced)
  • Data Ingestion in Hadoop using Sqoop and Flume (Self-paced)
  • Spark’s Big Data Engine and RDD Concepts (Self-paced)
  • Relational Data Processing with Spark - Spark SQL (Self- paced)
  • Machine Learning with Spark - Spark ML (Self-paced)

Data Visualization using Tableau

  • Data Connection and Visualization in Tableau
  • Calculations in Tableau
  • Advanced Visualizations
  • Sharing Your Insights Through Dashboards
  • Capstone Project

SQL Essentials for Data Science and AI (Self-paced)

  • Introduction to RDBMS and MySQL
  • Database Modeling
  • Creating Databases and Tables
  • Querying and Modifying Tables
  • Joins and Functions in MySQL
  • Database Integration with Python

Mastering ChatGPT (Self-paced)

  • Introduction to OpenAI and ChatGPT
  • Business Use Cases of ChatGPT
  • Deploying and Integrating ChatGPT in Business Applications
  • GPT Models, Pre-processing and Fine-tuning ChatGPT
  • Working with GPT-3 and OpenAI API

Sequence Learning (Self-paced)

  • Module 1: Introduction to Sequence Learning
  • Module 2: RNN vs LSTM with Google Stock Price
  • Module 3: Sentiment Analysis on Zomato Reviews using LSTM
  • Module 4: Introduction to the Transformer Model
  • Module 5: BERT and GPT2 using Transformer
  • Module 6: Machine Translation with MT5
  • Module 7: Building a Question-Answer Prediction Model using BERT
  • Module 8: Assessment

Power BI for Data Visualization (Self-paced)

  • Power BI Desktop and Data Transformation
  • Data Analysis Expression (DAX)
  • Data Visualization and Power BI Service

Like the Curriculum?

...

Enroll in Data Science Program by IIT Guwahati

We have recieved your contact details.

You will recieve an email from us shortly.

As a next step, you have to fill the detailed Application Form

Thanks for your interest!

For faster processing of your candidature, please fill the Application Form

9+ Data Science and AI Industry Projects

Industry projects will be a part of your Advanced Certificate Program in Data Science and AI to consolidate your learning. Industry projects will ensure you have the real-world experience to start your career in Data Science and AI.

Stock Market Analysis

Domain: Finance and Investment Scenario: Data Science extracts meaningful insight from chunks of raw data, which is useful to different business segments for plannin...

911 Call Data Analysis

Domain: Public safety and Emergency Scenario: This data is from Montgomery Country in the Pennsylvania State USA. 911 is the most important social security feature of ...

ML Model Building for MyCars Vehicle Store

Domain: Retail Scenario: "MyCars" is a new age startup laying foundations for setting up a car resell domain and they are setting up a team of ML experts to make predic...

Diagnosis of Heart Disease using Machine Learning

Domain: Healthcare Scenario: “AIHealth” is a new age startup laying foundations in the healthcare domain by solving some of the most prominent problems by using ...

MyHostels Data Insights

Domain: Hospitality Scenario: MyHostels in Europe is a very popular hostel chain across various countries, and they want to focus on hostel travelers in the next financ...

Travel Aggregator Analysis

Domain: Travel and Tourism Scenario: A new Indian start-up, "MyNextBooking” is an aggregator (comparisons for getting the best price on online travel booking) on top...

Netflix-like Recommender System

Domain: OTT Scenario: “MyNextMovie” is a budding startup in the space of recommendations on top of various OTT platforms providing suggestions to its customer base ...

Criminal Identification and Detection App

Domain: Security and Defense Scenario: The INTERPOL Face Recognition System (IFRS) contains facial images received from more than 179 countries which makes it a uniq...

Building Interactive Dashboards for a Retail Store

Domain: Retail Scenario: You have been recruited as a freelancer for a Retail store that supplies Furniture, Office Supplies and Technology products to customers acr...

AI and Data Science Program by IIT Guwahati Admission Process

Enrolment form.

You will be required to provide personal, educational, and professional details. Once we receive your details, our Admissions Counselor will reach out to take your candidature further.

Interview and Offer Letter

A one-on-one chat with our SME to understand your basic knowledge, prior work experience, and your expectations from the course. After your interview assessment, you will receive an offer letter from us.

Payment & Batch Allotment

Based on your application form and interview performance, your final fee will be determined. After you make an upfront payment to confirm your batch, you will be given course credentials and your learning journey will begin!

AI and Data Science Program Fee

Payment Image

Is this AI and Data science online or offline program?

What is data science, what is artificial intelligence, how long is this data science program by iit guwahati, what will be the mode of instruction for this data science program by iit guwahati, what is this ai and data science program by iit guwahati, what if i miss a live class, who will be the instructor for this data science program by iit guwahati, who is a data scientist, who is an ai engineer, when will the live classes be conducted, how much time will i have to spend to complete this data science program by iit guwahati, can this program help me learn data science and ai from scratch, what can i expect from the advanced cloud lab, can i pursue this data science program by iit guwahati after my graduation, is this a degree program, what are the other artificial intelligence and data science program offered by edureka apart from this course.

  • Artificial Intelligence Course

Machine Learning Course Masters Program

  • Data Science Course

Who is eligible to apply for this Data Science Program by IIT Guwahati?

Can i enroll in this online program if i don’t have any prior knowledge of data science or ai, what is the admissions process for this data science program by iit guwahati, how can i reach out to the admissions office, who will conduct the admissions interview, what is the fee for this advanced certificate program in data science and ai, is there an emi plan, can i get a loan for this data science program by iit guwahati, if i opt out of this online data science program by iit guwahati, will i get a refund, can i change the batch after enrolling in the data science program by iit guwahati, what do i need to attend the program (internet, system specifications, etc.), will i be provided assistance if i find it difficult to understand the concepts, call us for any query.

logo

Trending Certification Courses

Devops certification training course, aws certification training course for solutions architects, pmp certification training course, microsoft power bi certification training course, ceh v12 - certified ethical hacking course online, selenium certification training course, cissp certification training course, azure data engineer certification (dp-203) course, salesforce training course for administrator & app builder, prince2® foundation & practitioner certification course, tableau certification training course, react js certification course online, prompt engineering course, trending master courses, devops engineer masters program, project management training [masters program], cloud architect course - masters program, full stack developer course - mern stack, business analyst masters course, microsoft azure cloud master training, automation testing courses [masters program], business intelligence masters program, post graduate program in devops, pgp in cyber security and ethical hacking, generative ai in business: university of cambridge online, human-computer interaction (hci) for ai systems design, post graduate program in cloud computing.

  • News & Media
  • Blog Sitemap
  • Community Sitemap
  • City Sitemap
  • Corporate Training

Work with us

  • Become an Instructor
  • Become an Affiliate
  • Become a Partner
  • Hire from Edureka

Download App

appleplaystore

  • Cloud Computing
  • Cyber Security
  • BI and Visualization
  • Programming & Frameworks
  • Data Science
  • Project Management and Methodologies
  • PG Programs
  • Artificial Intelligence
  • Software Testing
  • Frontend Development
  • Robotic Process Automation
  • Data Warehousing and ETL
  • Mobile Development
  • Operating Systems
  • Architecture & Design Patterns
  • Digital Marketing

Trending blog articles

  • Selenium tutorial
  • Selenium interview questions
  • Java tutorial
  • What is HTML
  • Java interview questions
  • PHP tutorial
  • JavaScript interview questions
  • Spring tutorial
  • PHP interview questions
  • Inheritance in Java
  • Polymorphism in Java
  • Spring interview questions
  • Pointers in C
  • Linux commands
  • Android tutorial
  • JavaScript tutorial
  • jQuery tutorial
  • SQL interview questions
  • MySQL tutorial
  • Machine learning tutorial
  • Python tutorial
  • What is machine learning
  • Ethical hacking tutorial
  • SQL injection
  • AWS certification career opportunities

+1 415 214-8373

Available 24x7 for your queries

Our experts will get in touch with you in the next 24 hours

flag_in

Let's Get Started

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

I wish to receive promotional offers from edureka

I agree to the T&C and Privacy Policy

Already have an account? Log In

At least 1 upper-case and 1 lower-case letter

Minimum 8 characters and Maximum 50 characters

By Signing up you agree to our T&C and Privacy Policy

Please Note: By continuing and signing in, you agree to Edureka's Terms & Conditions and Privacy Policy .

Don’t have an account? Sign Up

Enter your Email Address above to get a verification code.

< Back to Sign Up

Verification code sent to

Need help? Email us at [email protected]

call

Course & Exam Pages

IMAGES

  1. Request a Powerful Data Science Capstone from Us & Shine

    sample capstone project for data science

  2. Capstone Project Final Report Sample by CapstoneProject

    sample capstone project for data science

  3. Capstone Project Ideas For Data Analytics

    sample capstone project for data science

  4. Capstone Project Ideas for Data Science

    sample capstone project for data science

  5. An Exemplary Data Science Capstone, Annotated

    sample capstone project for data science

  6. Capstone Project Ideas For Data Analytics

    sample capstone project for data science

VIDEO

  1. Rental Car Program

  2. Week 12

  3. Data Science Capstone Project Spotlight: Language Detection App

  4. Project Capstone Basis Data Kelompok 9

  5. Grade 7 Science Investigatory Project Oral Defense

  6. Demonstration Video Capstone Project Demeter

COMMENTS

  1. 21 Interesting Data Science Capstone Project Ideas [2024]

    Best Data Science Capstone Project Ideas - According to Skill Level. Data science capstone projects are a great way to showcase your skills and apply what you've learned in a real-world context. Here are some project ideas categorized by skill level: Beginner-Level Data Science Capstone Project Ideas. 1. Exploratory Data Analysis (EDA) on a ...

  2. 10 Unique Data Science Capstone Project Ideas

    Project Idea #10: Building a Chatbot. A chatbot is a computer program that uses artificial intelligence to simulate human conversation. It can interact with users in a natural language through text or voice. Building a chatbot can be an exciting and challenging data science capstone project.

  3. An Exemplary Data Science Capstone, Annotated

    Since there was a lot of content, I'll conclude with my top three tips for doing a great data science capstone project: Choose a good data set: a small, uninteresting, or otherwise hard-to-analyze data set will make it substantially harder to make a great project. Include all of the following: Data cleaning.

  4. Final Capstone Project for IBM Data Science Professional ...

    Final Capstone Project for IBM Data Science Professional Certification - GitHub - vikthak/IBM-AppliedDataScience-Capstone-FINAL: Final Capstone Project for IBM Data Science Professional Certification

  5. A friendly walk-through of a Data Science Capstone Project

    Many websites and online courses focus on what beginners need to learn in order to become data scientists or on the importance of doing capstone projects to showcase one's skills.

  6. UCSD Data Science Capstone Projects: 2021-2022

    This page contains the project materials for UCSD's Data Science Capstone sequence. Projects are grouped into subject-matter areas called domains of inquiry, led by the domain mentors listed below. Each project listing contains: The title and abstract, A link to the project's website. A link to the project's code repository.

  7. GitHub

    Executive summary. In this capstone project, we will predict if the SpaceX Falcon 9 first stage will land successfully using several machine learning classification algorithms. The main steps in this project include: Data collection, wrangling, and formatting. Exploratory data analysis. Interactive data visualization. Machine learning prediction.

  8. Capstone Projects

    Faculty-Sponsored Capstone Projects. A DSI faculty member proposes a research project and advises a team of students working on this project. This is a great way to run a research project with enthusiastic students, eager to try out their newly acquired data science skills in a research setting.

  9. Data Science Capstone Projects #18

    Data Science Capstone Projects #18. by Ekaterina Butyugina. 2 August 2022. In this blog post, we highlight the projects from both the part-time and full-time Data Science students that were completed at the end of the program. Take a look at the results they've achieved in such a short period of time.

  10. Data Science Capstone Course by Johns Hopkins University

    Introduction to Task 1: Getting and Cleaning the Data • 1 minute. Regular Expressions: Part 1 (Optional) • 5 minutes. Regular Expressions: Part 2 (Optional) • 8 minutes. 6 readings • Total 52 minutes. A Note of Explanation • 2 minutes. Project Overview • 10 minutes. Syllabus • 10 minutes.

  11. Data Science Project Ideas To Try

    A data science project is a practical application of your skills. A typical data science project allows you to use skills in data collection, cleaning, exploratory data analysis, visualization, programming, machine learning, and so on. It helps you take your skills to solve real-world problems.

  12. Data Science Capstone Projects From Praxis Business School

    Program Details and Capstone Projects. For people who are not aware - Praxis Business School offers a year-long program - PGP in Data Science with ML & AI at both its campuses - Kolkata and Bengaluru. The program is structured in a manner where the first 9 months are spent in the classroom with in-house and industry faculty and the last 3 ...

  13. Applied Data Science (MS) Student Capstone Projects

    Case Analysis Capstone (ADS670) aims to develop both technical and soft skills that are not directly taught in the traditional courses in the program, but are relevant and critical in order to develop, innovate and communicate in modern data science. This is a project-oriented capstone that will harness the skills gained throughout the program.

  14. Capstone Projects

    The culminating experience in the Master's in Applied Data Science program is a Capstone Project where you'll put your knowledge and skills into practice. You will immerse yourself in a real business problem and will gain valuable, data driven insights using authentic data. Together with project sponsors, you will develop a data science ...

  15. MATH 4995: Capstone Project for Data Science

    Project description: [ project2.pdf ] Kaggle Contest: Pawpularity Contest: Predict the popularity of shelter pet photos. Kaggle Contest: Natural Language Processing with Disaster Tweets: Predict which Tweets are about real disasters and which ones are not. Kaggle Contest: Predict Survival on the Titanic.

  16. Capstone Projects

    Capstone Projects. M.S. in Data Science students are required to complete a capstone project. Capstone projects challenge students to acquire and analyze data to solve real-world problems. Project teams consist of two to four students and a faculty advisor. Teams select their capstone project at the beginning of the year and work on the project ...

  17. Data Science Capstone

    This capstone course is the culmination of the Master of Liberal Arts, data science, where students execute their research proposal from CSCI S-597. It gives students the opportunity to collaborate on a complex research topic using their data science skills. At the completion of the capstone, students are able to demonstrate their ability to ...

  18. Master's in Data Science

    Capstone 2024: Submissions Now Open Master's in Data Science Capstone Project Capstone Project CDS master's students have a unique opportunity to solve real-world problems through the capstone course in the final year of their program. The capstone course is designed to apply knowledge into practice and to develop and improve critical skills such as problem-solving …

  19. How I created my first Data Analytics Capstone Project

    I have chosen the cycle data location about Divi Bikes as a Capstone project here , the data is provided by the course and is completely open source under-free license to use for analysis purpose.

  20. Thesis/Capstone for Master's in Data Science

    Capstone Checklist. Start early — set aside a minimum of one to two months prior to the capstone quarter to determine the industry and modeling interests.; Networking — pitch your idea to potential organizations for projects and focus on the business benefits you can provide.; Permission request — make sure your final project can be shared with others in the course and the information ...

  21. Python Capstone Project

    Explore and run machine learning code with Kaggle Notebooks | Using data from [Private Datasource] code. New Notebook. table_chart. New Dataset. tenancy. New Model. emoji_events. New Competition. corporate_fare. New Organization. No Active Events. Create notebooks and keep track of their status here. add New Notebook. auto_awesome_motion. 0 ...

  22. Capstone Project Ideas For Data Analytics

    Vertical Institute is now showcasing capstone projects from our alumni! See samples here! 🎉 Get certified in 21 hours, with up to 70% subsidy. Enquire now! 🧑‍💻 ... Data Science; Bank Loan Payment Analysis. Data Analytics Capstone Project by Ng Shao Zhi. Bank Marketing Campaign.

  23. Master of Applied Data Science

    Grow your data science career with a Data Science Master's degree from the University of Michigan, one of the best-ranked universities in the world. ... with three capstone courses that award 2-3 credits and may last longer than one month. Starting with the Fall 2020 cohort, students will be able to take a full time load, which is equivalent ...

  24. Capstone Project

    Additional field-based credits from the capstone program accrue over the final 12 months of the 20-month MSHS program. The capstone project may be conducted on campus at Cedars-Sinai or in other approved healthcare organizations. Students work on their capstone project on a schedule agreed upon with their primary mentor.

  25. Capstone Projects

    Capstone Projects. Online M.S. in Data Science students are required to complete a capstone project. Capstone projects challenge students to acquire and analyze data to solve real-world problems. Project teams consist of two to four students and a faculty advisor. Teams select their capstone project in term 4 and work on the project in term 5 ...

  26. Advanced Certification in Data Science and AI by IIT Guwahati

    About the Data Science Program by IIT Guwahati. E&ICT Academy, IIT Guwahati, and Edureka have collaborated to create this Advanced Certificate Program in Data Science and AI. The Indian Institute of Technology, Guwahati, is one of the top 10 engineering institutions as per the latest NIRF rankings and among the top 150 universities in Asia.

  27. Course & Exam Pages

    AP Project Based Learning Series; Teacher Webinars; AP Community; ... More. Display Site Navigation. Course & Exam Pages. AP Capstone. About the AP Capstone Diploma Program. AP Research. AP Seminar. Arts. About the AP Art and Design Program. AP 2-D Art and Design. AP 3-D Art and Design. ... Math and Computer Science. About AP Computer Science ...