computer vision thesis project

Master's theses in Computer Vision

If you want to do your master's thesis project within the field of Computer Vision, there are several options:

Internal Master's thesis at the Computer Vision Lab (CVL) Internal master's theses are normally connected to a research project, and explore a specific research idea. Some project suggestions are listed here: CVL Master's thesis proposal repository . If you already have an idea for a project, you may also contact one of the CVL examiners directly. See the list of examiners below.
External Master's thesis at a company We maintain a list of research projects defined by external partners. These project proposals are found here: External Master's thesis project proposals . Future external projects may ** also be posted here: LiU exjobbsportal . If you do not find an interesting project on the page above, you may also contact companies/organizations directly. Oftent they have plans for projects, or are able to create a new one for you. A list of suitable companies can be found here: Computer Vision oriented companies/organisations .

If you have tried the possibilities above and still not found any interesting project, you can also directly contact one of the examiners at CVL, see list of examiners below.

Assignment of examiner and internal supervisor

Examiners for a Master's thesis in computer vision:

Per-Erik Forssén (CVL Master's thesis coordinator)
Michael Felsberg
Maria Magnusson
Mårten Wadenbäck
Bastian Wandt
Jörgen Ahlberg
Amanda Berg
Leif Haglund
Lasse Alfredsson

Assignment of examiner is made after you contact the coordinator or an examiner (you will not necessarily get the one you contact). When contacting an examiner, you should provide the following information:

Your name and personal number (we need to check your qualifications in Ladok)
Name of the company and email to a contact person (for external Master's projects)
Whether it is a master's thesis or bachelor's thesis
When you want to start
A project description (e.g. the ad from the company).
Suggested course code for the project, corresponding to your main field of study (Sv:huvudområde) (e.g. TQET33, TQDT33, TQME33, TQMD33, TQTM33).

Thesis presentation in Swedish

English to Swedish translations for Computer Vision (in Swedish) .
Swedish Optical Terminology (in Swedish).
Statistiktermer på Svenska (in Swedish).

Scientific publication of a master's thesis work

It is not uncommon that master's theses in computer vision are of such quality that they can be turned into scientific publications. This usually requires substantial amount of extra work, but could be a good acheivement to put in your CV. If you are interested in submitting your work for peer review at a conference or in a journal, check with your examiner or university supervisor for hints on how to frame the work and where to submit it. If you feel that your university supervisor has helped you substantially, also consider inviting him/her for co-authorship.

Other information sources

University regulations regarding Master's thesis projects are definied in Studieinfo .
There are also department spectific rules and practical information .
Information about master's theses from LiTH. (will soon be moved)
An attendance form for master's thesis presentations (Framläggningsblankett).
Publishing your student thesis page at LiU Electronic Press. We recommend that the defence is announced one week in advance, at the vision-seminars mailing list. List subscribers may do so by sending an email to: vision-seminars.isy AT lists.liu.se .
Session 1 (Klas Nordberg)
Session 2 (Marcus Wallenberg)
Automatic grammar checking tools for the English language are highly recommented. One such tool is Grammarly.
Help with writing in English can also be had from Academic English Support at IKK .
** : This is contingent on this site being fixed to: (i) allow a proposal to be categorized as more than one "Main field of study" , as all Computer Vision projects fall under 2-4 "Main fields of study" (ii) allow easier inclusion of PDF attachments. Right now the proposal has to first be added, then removed, then found, then the attachment added, then the proposal should be added again.

A list of completed theses and new thesis topics from the Computer Vision Group.

Are you about to start a BSc or MSc thesis? Please read our instructions for preparing and delivering your work.

Below we list possible thesis topics for Bachelor and Master students in the areas of Computer Vision, Machine Learning, Deep Learning and Pattern Recognition. The project descriptions leave plenty of room for your own ideas. If you would like to discuss a topic in detail, please contact the supervisor listed below and Prof. Paolo Favaro to schedule a meeting. Note that for MSc students in Computer Science it is required that the official advisor is a professor in CS.

AI deconvolution of light microscopy images

Level: master.

Background Light microscopy became an indispensable tool in life sciences research. Deconvolution is an important image processing step in improving the quality of microscopy images for removing out-of-focus light, higher resolution, and beter signal to noise ratio. Currently classical deconvolution methods, such as regularisation or blind deconvolution, are implemented in numerous commercial software packages and widely used in research. Recently AI deconvolution algorithms have been introduced and being currently actively developed, as they showed a high application potential.

Aim Adaptation of available AI algorithms for deconvolution of microscopy images. Validation of these methods against state-of-the -art commercially available deconvolution software.

Material and Methods Student will implement and further develop available AI deconvolution methods and acquire test microscopy images of different modalities. Performance of developed AI algorithms will be validated against available commercial deconvolution software.

Al algorithm development and implementation: 50%.
Data acquisition: 10%.
Comparison of performance: 40 %.

Requirements

Interest in imaging.
Solid knowledge of AI.
Good programming skills.

Supervisors Paolo Favaro, Guillaume Witz, Yury Belyaev.

Institutes Computer Vison Group, Digital Science Lab, Microscopy imaging Center.

Contact Yury Belyaev, Microscopy imaging Center, [email protected] , + 41 78 899 0110.

Instance segmentation of cryo-ET images

Level: bachelor/master.

In the 1600s, a pioneering Dutch scientist named Antonie van Leeuwenhoek embarked on a remarkable journey that would forever transform our understanding of the natural world. Armed with a simple yet ingenious invention, the light microscope, he delved into uncharted territory, peering through its lens to reveal the hidden wonders of microscopic structures. Fast forward to today, where cryo-electron tomography (cryo-ET) has emerged as a groundbreaking technique, allowing researchers to study proteins within their natural cellular environments. Proteins, functioning as vital nano-machines, play crucial roles in life and understanding their localization and interactions is key to both basic research and disease comprehension. However, cryo-ET images pose challenges due to inherent noise and a scarcity of annotated data for training deep learning models.

Credit: S. Albert et al./PNAS (CC BY 4.0)

To address these challenges, this project aims to develop a self-supervised pipeline utilizing diffusion models for instance segmentation in cryo-ET images. By leveraging the power of diffusion models, which iteratively diffuse information to capture underlying patterns, the pipeline aims to refine and accurately segment cryo-ET images. Self-supervised learning, which relies on unlabeled data, reduces the dependence on extensive manual annotations. Successful implementation of this pipeline could revolutionize the field of structural biology, facilitating the analysis of protein distribution and organization within cellular contexts. Moreover, it has the potential to alleviate the limitations posed by limited annotated data, enabling more efficient extraction of valuable information from cryo-ET images and advancing biomedical applications by enhancing our understanding of protein behavior.

Methods The segmentation pipeline for cryo-electron tomography (cryo-ET) images consists of two stages: training a diffusion model for image generation and training an instance segmentation U-Net using synthetic and real segmentation masks.

1. Diffusion Model Training: a. Data Collection: Collect and curate cryo-ET image datasets from the EMPIAR database (https://www.ebi.ac.uk/empiar/). b. Architecture Design: Select an appropriate architecture for the diffusion model. c. Model Evaluation: Cryo-ET experts will help assess image quality and fidelity through visual inspection and quantitative measures 2. Building the Segmentation dataset: a. Synthetic and real mask generation: Use the trained diffusion model to generate synthetic cryo-ET images. The diffusion process will be seeded from either a real or a synthetic segmentation mask. This will yield to pairs of cryo-ET images and segmentation masks. 3. Instance Segmentation U-Net Training: a. Architecture Design: Choose an appropriate instance segmentation U-Net architecture. b. Model Evaluation: Evaluate the trained U-Net using precision, recall, and F1 score metrics.

By combining the diffusion model for cryo-ET image generation and the instance segmentation U-Net, this pipeline provides an efficient and accurate approach to segment structures in cryo-ET images, facilitating further analysis and interpretation.

References 1. Kwon, Diana. "The secret lives of cells-as never seen before." Nature 598.7882 (2021): 558-560. 2. Moebel, Emmanuel, et al. "Deep learning improves macromolecule identification in 3D cellular cryo-electron tomograms." Nature methods 18.11 (2021): 1386-1394. 3. Rice, Gavin, et al. "TomoTwin: generalized 3D localization of macromolecules in cryo-electron tomograms with structural data mining." Nature Methods (2023): 1-10.

Contacts Prof. Thomas Lemmin Institute of Biochemistry and Molecular Medicine Bühlstrasse 28, 3012 Bern ( [email protected] )

Prof. Paolo Favaro Institute of Computer Science Neubrückstrasse 10 3012 Bern ( [email protected] )

Adding and removing multiple sclerosis lesions with to imaging with diffusion networks

Background multiple sclerosis lesions are the result of demyelination: they appear as dark spots on t1 weighted mri imaging and as bright spots on flair mri imaging. image analysis for ms patients requires both the accurate detection of new and enhancing lesions, and the assessment of atrophy via local thickness and/or volume changes in the cortex. detection of new and growing lesions is possible using deep learning, but made difficult by the relative lack of training data: meanwhile cortical morphometry can be affected by the presence of lesions, meaning that removing lesions prior to morphometry may be more robust. existing ‘lesion filling’ methods are rather crude, yielding unrealistic-appearing brains where the borders of the removed lesions are clearly visible., aim: denoising diffusion networks are the current gold standard in mri image generation [1]: we aim to leverage this technology to remove and add lesions to existing mri images. this will allow us to create realistic synthetic mri images for training and validating ms lesion segmentation algorithms, and for investigating the sensitivity of morphometry software to the presence of ms lesions at a variety of lesion load levels., materials and methods: a large, annotated, heterogeneous dataset of mri data from ms patients, as well as images of healthy controls without white matter lesions, will be available for developing the method. the student will work in a research group with a long track record in applying deep learning methods to neuroimaging data, as well as experience training denoising diffusion networks..

Nature of the Thesis:

Literature review: 10%

Replication of Blob Loss paper: 10%

Implementation of the sliding window metrics:10%

Training on MS lesion segmentation task: 30%

Extension to other datasets: 20%

Results analysis: 20%

Fig. Results of an existing lesion filling algorithm, showing inadequate performance

Requirements:

Interest/Experience with image processing

Python programming knowledge (Pytorch bonus)

Interest in neuroimaging

Supervisor(s):

PD. Dr. Richard McKinley

Institutes: Diagnostic and Interventional Neuroradiology

Center for Artificial Intelligence in Medicine (CAIM), University of Bern

References: [1] Brain Imaging Generation with Latent Diffusion Models , Pinaya et al, Accepted in the Deep Generative Models workshop @ MICCAI 2022 , https://arxiv.org/abs/2209.07162

Contact : PD Dr Richard McKinley, Support Centre for Advanced Neuroimaging ( [email protected] )

Improving metrics and loss functions for targets with imbalanced size: sliding window Dice coefficient and loss.

Background The Dice coefficient is the most commonly used metric for segmentation quality in medical imaging, and a differentiable version of the coefficient is often used as a loss function, in particular for small target classes such as multiple sclerosis lesions. Dice coefficient has the benefit that it is applicable in instances where the target class is in the minority (for example, in case of segmenting small lesions). However, if lesion sizes are mixed, the loss and metric is biased towards performance on large lesions, leading smaller lesions to be missed and harming overall lesion detection. A recently proposed loss function (blob loss[1]) aims to combat this by treating each connected component of a lesion mask separately, and claims improvements over Dice loss on lesion detection scores in a variety of tasks.

Aim: The aim of this thesisis twofold. First, to benchmark blob loss against a simple, potentially superior loss for instance detection: sliding window Dice loss, in which the Dice loss is calculated over a sliding window across the area/volume of the medical image. Second, we will investigate whether a sliding window Dice coefficient is better corellated with lesion-wise detection metrics than Dice coefficient and may serve as an alternative metric capturing both global and instance-wise detection.

Materials and Methods: A large, annotated, heterogeneous dataset of MRI data from MS patients will be available for benchmarking the method, as well as our existing codebases for MS lesion segmentation. Extension of the method to other diseases and datasets (such as covered in the blob loss paper) will make the method more plausible for publication. The student will work alongside clinicians and engineers carrying out research in multiple sclerosis lesion segmentation, in particular in the context of our running project supported by the CAIM grant.

Fig. An annotated MS lesion case, showing the variety of lesion sizes

References: [1] blob loss: instance imbalance aware loss functions for semantic segmentation, Kofler et al, https://arxiv.org/abs/2205.08209

Idempotent and partial skull-stripping in multispectral MRI imaging

Background Skull stripping (or brain extraction) refers to the masking of non-brain tissue from structural MRI imaging. Since 3D MRI sequences allow reconstruction of facial features, many data providers supply data only after skull-stripping, making this a vital tool in data sharing. Furthermore, skull-stripping is an important pre-processing step in many neuroimaging pipelines, even in the deep-learning era: while many methods could now operate on data with skull present, they have been trained only on skull-stripped data and therefore produce spurious results on data with the skull present.

High-quality skull-stripping algorithms based on deep learning are now widely available: the most prominent example is HD-BET [1]. A major downside of HD-BET is its behaviour on datasets to which skull-stripping has already been applied: in this case the algorithm falsely identifies brain tissue as skull and masks it. A skull-stripping algorithm F not exhibiting this behaviour would be idempotent: F(F(x)) = F(x) for any image x. Furthermore, legacy datasets from before the availability of high-quality skull-stripping algorithms may still contain images which have been inadequately skull-stripped: currently the only solution to improve the skull-stripping on this data is to go back to the original datasource or to manually correct the skull-stripping, which is time-consuming and prone to error.

Aim: In this project, the student will develop an idempotent skull-stripping network which can also handle partially skull-stripped inputs. In the best case, the network will operate well on a large subset of the data we work with (e.g. structural MRI, diffusion-weighted MRI, Perfusion-weighted MRI, susceptibility-weighted MRI, at a variety of field strengths) to maximize the future applicability of the network across the teams in our group.

Materials and Methods: Multiple datasets, both publicly available and internal (encompassing thousands of 3D volumes) will be available. Silver standard reference data for standard sequences at 1.5T and 3T can be generated using existing tools such as HD-BET: for other sequences and field strengths semi-supervised learning or methods improving robustness to domain shift may be employed. Robustness to partial skull-stripping may be induced by a combination of learning theory and model-based approaches.

Dataset curation: 10%

Idempotent skull-stripping model building: 30%

Modelling of partial skull-stripping:10%

Extension of model to handle partial skull: 30%

Results analysis: 10%

Fig. An example of failed skull-stripping requiring manual correction

References: [1] Isensee, F, Schell, M, Pflueger, I, et al. Automated brain extraction of multisequence MRI using artificial neural networks. Hum Brain Mapp . 2019; 40: 4952– 4964. https://doi.org/10.1002/hbm.24750

Automated leaf detection and leaf area estimation (for Arabidopsis thaliana)

Correlating plant phenotypes such as leaf area or number of leaves to the genotype (i.e. changes in DNA) is a common goal for plant breeders and molecular biologists. Such data can not only help to understand fundamental processes in nature, but also can help to improve ecotypes, e.g., to perform better under climate change, or reduce fertiliser input. However, collecting data for many plants is very time consuming and automated data acquisition is necessary.

The project aims at building a machine learning model to automatically detect plants in top-view images (see examples below), segment their leaves (see Fig C) and to estimate the leaf area. This information will then be used to determine the leaf area of different Arabidopsis ecotypes. The project will be carried out in collaboration with researchers of the Institute of Plant Sciences at the University of Bern. It will also involve the design and creation of a dataset of plant top-views with the corresponding annotation (provided by experts at the Institute of Plant Sciences).

Contact: Prof. Dr. Paolo Favaro ( [email protected] )

Master Projects at the ARTORG Center

The Gerontechnology and Rehabilitation group at the ARTORG Center for Biomedical Engineering is offering multiple MSc thesis projects to students, which are interested in working with real patient data, artificial intelligence and machine learning algorithms. The goal of these projects is to transfer the findings to the clinic in order to solve today’s healthcare problems and thus to improve the quality of life of patients. Assessment of Digital Biomarkers at Home by Radar. [PDF] Comparison of Radar, Seismograph and Ballistocardiography and to Monitor Sleep at Home. [PDF] Sentimental Analysis in Speech. [PDF] Contact: Dr. Stephan Gerber ( [email protected] )

Internship in Computational Imaging at Prophesee

A 6 month intership at Prophesee, Grenoble is offered to a talented Master Student.

The topic of the internship is working on burst imaging following the work of Sam Hasinoff , and exploring ways to improve it using event-based vision.

A compensation to cover the expenses of living in Grenoble is offered. Only students that have legal rights to work in France can apply.

Anyone interested can send an email with the CV to Daniele Perrone ( [email protected] ).

Using machine learning applied to wearables to predict mental health

This Master’s project lies at the intersection of psychiatry and computer science and aims to use machine learning techniques to improve health. Using sensors to detect sleep and waking behavior has as of yet unexplored potential to reveal insights into health. In this study, we make use of a watch-like device, called an actigraph, which tracks motion to quantify sleep behavior and waking activity. Participants in the study consist of healthy and depressed adolescents and wear actigraphs for a year during which time we query their mental health status monthly using online questionnaires. For this masters thesis we aim to make use of machine learning methods to predict mental health based on the data from the actigraph. The ability to predict mental health crises based on sleep and wake behavior would provide an opportunity for intervention, significantly impacting the lives of patients and their families. This Masters thesis is a collaboration between Professor Paolo Favaro at the Institute of Computer Science ( [email protected] ) and Dr Leila Tarokh at the Universitäre Psychiatrische Dienste (UPD) ( [email protected] ). We are looking for a highly motivated individual interested in bridging disciplines.

Bachelor or Master Projects at the ARTORG Center

The Gerontechnology and Rehabilitation group at the ARTORG Center for Biomedical Engineering is offering multiple BSc- and MSc thesis projects to students, which are interested in working with real patient data, artificial intelligence and machine learning algorithms. The goal of these projects is to transfer the findings to the clinic in order to solve today’s healthcare problems and thus to improve the quality of life of patients. Machine Learning Based Gait-Parameter Extraction by Using Simple Rangefinder Technology. [PDF] Detection of Motion in Video Recordings [PDF] Home-Monitoring of Elderly by Radar [PDF] Gait feature detection in Parkinson's Disease [PDF] Development of an arthroscopic training device using virtual reality [PDF] Contact: Dr. Stephan Gerber ( [email protected] ), Michael Single ( [email protected]. ch )

Dynamic Transformer

Level: bachelor.

Visual Transformers have obtained state of the art classification accuracies [ViT, DeiT, T2T, BoTNet]. Mixture of experts could be used to increase the capacity of a neural network by learning instance dependent execution pathways in a network [MoE]. In this research project we aim to push the transformers to their limit and combine their dynamic attention with MoEs, compared to Switch Transformer [Switch], we will use a much more efficient formulation of mixing [CondConv, DynamicConv] and we will use this idea in the attention part of the transformer, not the fully connected layer.

Input dependent attention kernel generation for better transformer layers.

Publication Opportunity: Dynamic Neural Networks Meets Computer Vision (a CVPR 2021 Workshop)

Extensions:

The same idea could be extended to other ViT/Transformer based models [DETR, SETR, LSTR, TrackFormer, BERT]

Quantized ViT

Visual Transformers have obtained state of the art classification accuracies [ViT, CLIP, DeiT], but the best ViT models are extremely compute heavy and running them even only for inference (not doing backpropagation) is expensive. Running transformers cheaply by quantization is not a new problem and it has been tackled before for BERT [BERT] in NLP [Q-BERT, Q8BERT, TernaryBERT, BinaryBERT]. In this project we will be trying to quantize pretrained ViT models.

Quantizing ViT models for faster inference and smaller models without losing accuracy

Publication Opportunity: Binary Networks for Computer Vision 2021 (a CVPR workshop)

Extensions:

Having a fast pipeline for image inference with ViT will allow us to dig deep into the attention of ViT and analyze it, we might be able to prune some attention heads or replace them with static patterns (like local convolution or dilated patterns), We might be even able to replace the transformer with performer and increase the throughput even more [Performer].
The same idea could be extended to other ViT based models [DETR, SETR, LSTR, TrackFormer, CPTR, BoTNet, T2TViT]
Learning Transferable Visual Models From Natural Language Supervision [CLIP]
Visual Transformers: Token-based Image Representation and Processing for Computer Vision [ViT]
DeiT: Data-efficient Image Transformers [DeiT]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [BERT]
Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT [Q-BERT]
Q8BERT: Quantized 8Bit BERT [Q8BERT]
TernaryBERT: Distillation-aware Ultra-low Bit BERT [TernaryBERT]
BinaryBERT: Pushing the Limit of BERT Quantization [BinaryBERT]
Rethinking Attention with Performers [Performer]
End-to-End Object Detection with Transformers [DETR]
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers [SETR]
End-to-end Lane Shape Prediction with Transformers [LSTR]
TrackFormer: Multi-Object Tracking with Transformers [TrackFormer]
CPTR: Full Transformer Network for Image Captioning [CPTR]
Bottleneck Transformers for Visual Recognition [BoTNet]
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet [T2TViT]

Multimodal Contrastive Learning

Recently contrastive learning has gained a lot of attention for self-supervised image representation learning [SimCLR, MoCo]. Contrastive learning could be extended to multimodal data, like videos (images and audio) [CMC, CoCLR]. Most contrastive methods require large batch sizes (or large memory pools) which makes them expensive for training. In this project we are going to use non batch size dependent contrastive methods [SwAV, BYOL, SimSiam] to train multimodal representation extractors.

Our main goal is to compare the proposed method with the CMC baseline, so we will be working with STL10, ImageNet, UCF101, HMDB51, and NYU Depth-V2 datasets.

Inspired by the recent works on smaller datasets [ConVIRT, CPD], to accelerate the training speed, we could start with two pretrained single-modal models and finetune them with the proposed method.

Extending SwAV to multimodal datasets
Grasping a better understanding of the BYOL

Publication Opportunity: MULA 2021 (a CVPR workshop on Multimodal Learning and Applications)

Most knowledge distillation methods for contrastive learners also use large batch sizes (or memory pools) [CRD, SEED], the proposed method could be extended for knowledge distillation.
One could easily extend this idea to multiview learning, for example one could have two different networks working on the same input and train them with contrastive learning, this may lead to better models [DeiT] by cross-model inductive biases communications.
Self-supervised Co-training for Video Representation Learning [CoCLR]
Learning Spatiotemporal Features via Video and Text Pair Discrimination [CPD]
Audio-Visual Instance Discrimination with Cross-Modal Agreement [AVID-CMA]
Self-Supervised Learning by Cross-Modal Audio-Video Clustering [XDC]
Contrastive Multiview Coding [CPC]
Contrastive Learning of Medical Visual Representations from Paired Images and Text [ConVIRT]
A Simple Framework for Contrastive Learning of Visual Representations [SimCLR]
Momentum Contrast for Unsupervised Visual Representation Learning [MoCo]
Bootstrap your own latent: A new approach to self-supervised Learning [BYOL]
Exploring Simple Siamese Representation Learning [SimSiam]
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments [SwAV]
Contrastive Representation Distillation [CRD]
SEED: Self-supervised Distillation For Visual Representation [SEED]

Robustness of Neural Networks

Neural Networks have been found to achieve surprising performance in several tasks such as classification, detection and segmentation. However, they are also very sensitive to small (controlled) changes to the input. It has been shown that some changes to an image that are not visible to the naked eye may lead the network to output an incorrect label. This thesis will focus on studying recent progress in this area and aim to build a procedure for a trained network to self-assess its reliability in classification or one of the popular computer vision tasks.

Contact: Paolo Favaro

Masters projects at sitem center

The Personalised Medicine Research Group at the sitem Center for Translational Medicine and Biomedical Entrepreneurship is offering multiple MSc thesis projects to the biomed eng MSc students that may also be of interest to the computer science students. Automated quantification of cartilage quality for hip treatment decision support. PDF Automated quantification of massive rotator cuff tears from MRI. PDF Deep learning-based segmentation and fat fraction analysis of the shoulder muscles using quantitative MRI. PDF Unsupervised Domain Adaption for Cross-Modality Hip Joint Segmentation. PDF Contact: Dr. Kate Gerber

Internships/Master thesis @ Chronocam

3-6 months internships on event-based computer vision. Chronocam is a rapidly growing startup developing event-based technology, with more than 15 PhDs working on problems like tracking, detection, classification, SLAM, etc. Event-based computer vision has the potential to solve many long-standing problems in traditional computer vision, and this is a super exciting time as this potential is becoming more and more tangible in many real-world applications. For next year we are looking for motivated Master and PhD students with good software engineering skills (C++ and/or python), and preferable good computer vision and deep learning background. PhD internships will be more research focused and possibly lead to a publication. For each intern we offer a compensation to cover the expenses of living in Paris. List of some of the topics we want to explore:

Photo-realistic image synthesis and super-resolution from event-based data (PhD)
Self-supervised representation learning (PhD)
End-to-end Feature Learning for Event-based Data
Bio-inspired Filtering using Spiking Networks
On-the fly Compression of Event-based Streams for Low-Power IoT Cameras
Tracking of Multiple Objects with a Dual-Frequency Tracker
Event-based Autofocus
Stabilizing an Event-based Stream using an IMU
Crowd Monitoring for Low-power IoT Cameras
Road Extraction from an Event-based Camera Mounted in a Car for Autonomous Driving
Sign detection from an Event-based Camera Mounted in a Car for Autonomous Driving
High-frequency Eye Tracking

Email with attached CV to Daniele Perrone at [email protected] .

Contact: Daniele Perrone

Object Detection in 3D Point Clouds

Today we have many 3D scanning techniques that allow us to capture the shape and appearance of objects. It is easier than ever to scan real 3D objects and transform them into a digital model for further processing, such as modeling, rendering or animation. However, the output of a 3D scanner is often a raw point cloud with little to no annotations. The unstructured nature of the point cloud representation makes it difficult for processing, e.g. surface reconstruction. One application is the detection and segmentation of an object of interest. In this project, the student is challenged to design a system that takes a point cloud (a 3D scan) as input and outputs the names of objects contained in the scan. This output can then be used to eliminate outliers or points that belong to the background. The approach involves collecting a large dataset of 3D scans and training a neural network on it.

Contact: Adrian Wälchli

Shape Reconstruction from a Single RGB Image or Depth Map

A photograph accurately captures the world in a moment of time and from a specific perspective. Since it is a projection of the 3D space to a 2D image plane, the depth information is lost. Is it possible to restore it, given only a single photograph? In general, the answer is no. This problem is ill-posed, meaning that many different plausible depth maps exist, and there is no way of telling which one is the correct one. However, if we cover one of our eyes, we are still able to recognize objects and estimate how far away they are. This motivates the exploration of an approach where prior knowledge can be leveraged to reduce the ill-posedness of the problem. Such a prior could be learned by a deep neural network, trained with many images and depth maps.

CNN Based Deblurring on Mobile

Deblurring finds many applications in our everyday life. It is particularly useful when taking pictures on handheld devices (e.g. smartphones) where camera shake can degrade important details. Therefore, it is desired to have a good deblurring algorithm implemented directly in the device. In this project, the student will implement and optimize a state-of-the-art deblurring method based on a deep neural network for deployment on mobile phones (Android). The goal is to reduce the number of network weights in order to reduce the memory footprint while preserving the quality of the deblurred images. The result will be a camera app that automatically deblurs the pictures, giving the user a choice of keeping the original or the deblurred image.

Depth from Blur

If an object in front of the camera or the camera itself moves while the aperture is open, the region of motion becomes blurred because the incoming light is accumulated in different positions across the sensor. If there is camera motion, there is also parallax. Thus, a motion blurred image contains depth information. In this project, the student will tackle the problem of recovering a depth-map from a motion-blurred image. This includes the collection of a large dataset of blurred- and sharp images or videos using a pair or triplet of GoPro action cameras. Two cameras will be used in stereo to estimate the depth map, and the third captures the blurred frames. This data is then used to train a convolutional neural network that will predict the depth map from the blurry image.

Unsupervised Clustering Based on Pretext Tasks

The idea of this project is that we have two types of neural networks that work together: There is one network A that assigns images to k clusters and k (simple) networks of type B perform a self-supervised task on those clusters. The goal of all the networks is to make the k networks of type B perform well on the task. The assumption is that clustering in semantically similar groups will help the networks of type B to perform well. This could be done on the MNIST dataset with B being linear classifiers and the task being rotation prediction.

Adversarial Data-Augmentation

The student designs a data augmentation network that transforms training images in such a way that image realism is preserved (e.g. with a constrained spatial transformer network) and the transformed images are more difficult to classify (trained via adversarial loss against an image classifier). The model will be evaluated for different data settings (especially in the low data regime), for example on the MNIST and CIFAR datasets.

Unsupervised Learning of Lip-reading from Videos

People with sensory impairment (hearing, speech, vision) depend heavily on assistive technologies to communicate and navigate in everyday life. The mass production of media content today makes it impossible to manually translate everything into a common language for assistive technologies, e.g. captions or sign language. In this project, the student employs a neural network to learn a representation for lip-movement in videos in an unsupervised fashion, possibly with an encoder-decoder structure where the decoder reconstructs the audio signal. This requires collecting a large dataset of videos (e.g. from YouTube) of speakers or conversations where lip movement is visible. The outcome will be a neural network that learns an audio-visual representation of lip movement in videos, which can then be leveraged to generate captions for hearing impaired persons.

Learning to Generate Topographic Maps from Satellite Images

Satellite images have many applications, e.g. in meteorology, geography, education, cartography and warfare. They are an accurate and detailed depiction of the surface of the earth from above. Although it is relatively simple to collect many satellite images in an automated way, challenges arise when processing them for use in navigation and cartography. The idea of this project is to automatically convert an arbitrary satellite image, of e.g. a city, to a map of simple 2D shapes (streets, houses, forests) and label them with colors (semantic segmentation). The student will collect a dataset of satellite image and topological maps and train a deep neural network that learns to map from one domain to the other. The data could be obtained from a Google Maps database or similar.

New Variables of Brain Morphometry: the Potential and Limitations of CNN Regression

Timo blattner · sept. 2022.

The calculation of variables of brain morphology is computationally very expensive and time-consuming. A previous work showed the feasibility of ex- tracting the variables directly from T1-weighted brain MRI images using a con- volutional neural network. We used significantly more data and extended their model to a new set of neuromorphological variables, which could become inter- esting biomarkers in the future for the diagnosis of brain diseases. The model shows for nearly all subjects a less than 5% mean relative absolute error. This high relative accuracy can be attributed to the low morphological variance be- tween subjects and the ability of the model to predict the cortical atrophy age trend. The model however fails to capture all the variance in the data and shows large regional differences. We attribute these limitations in part to the moderate to poor reliability of the ground truth generated by FreeSurfer. We further investigated the effects of training data size and model complexity on this regression task and found that the size of the dataset had a significant impact on performance, while deeper models did not perform better. Lack of interpretability and dependence on a silver ground truth are the main drawbacks of this direct regression approach.

Home Monitoring by Radar

Lars ziegler · sept. 2022.

Detection and tracking of humans via UWB radars is a promising and continuously evolving field with great potential for medical technology. This contactless method of acquiring data of a patients movement patterns is ideal for in home application. As irregularities in a patients movement patterns are an indicator for various health problems including neurodegenerative diseases, the insight this data could provide may enable earlier detection of such problems. In this thesis a signal processing pipeline is presented with which a persons movement is modeled. During an experiment 142 measurements were recorded by two separate radar systems and one lidar system which each consisted of multiple sensors. The models that were calculated on these measurements by the signal processing pipeline were used to predict the times when a person stood up or sat down. The predictions showed an accuracy of 72.2%.

Revisiting non-learning based 3D reconstruction from multiple images

Aaron sägesser · oct. 2021.

Arthroscopy consists of challenging tasks and requires skills that even today, young surgeons still train directly throughout the surgery. Existing simulators are expensive and rarely available. Through the growing potential of virtual reality(VR) (head-mounted) devices for simulation and their applicability in the medical context, these devices have become a promising alternative that would be orders of magnitude cheaper and could be made widely available. To build a VR-based training device for arthroscopy is the overall aim of our project, as this would be of great benefit and might even be applicable in other minimally invasive surgery (MIS). This thesis marks a first step of the project with its focus to explore and compare well-known algorithms in a multi-view stereo (MVS) based 3D reconstruction with respect to imagery acquired by an arthroscopic camera. Simultaneously with this reconstruction, we aim to gain essential measures to compare the VR environment to the real world, as validation of the realism of future VR tasks. We evaluate 3 different feature extraction algorithms with 3 different matching techniques and 2 different algorithms for the estimation of the fundamental (F) matrix. The evaluation of these 18 different setups is made with a reconstruction pipeline embedded in a jupyter notebook implemented in python based on common computer vision libraries and compared with imagery generated with a mobile phone as well as with the reconstruction results of state-of-the-art (SOTA) structure-from-motion (SfM) software COLMAP and Multi-View Environment (MVE). Our comparative analysis manifests the challenges of heavy distortion, the fish-eye shape and weak image quality of arthroscopic imagery, as all results are substantially worse using this data. However, there are huge differences regarding the different setups. Scale Invariant Feature Transform (SIFT) and Oriented FAST Rotated BRIEF (ORB) in combination with k-Nearest Neighbour (kNN) matching and Least Median of Squares (LMedS) present the most promising results. Overall, the 3D reconstruction pipeline is a useful tool to foster the process of gaining measurements from the arthroscopic exploration device and to complement the comparative research in this context.

Examination of Unsupervised Representation Learning by Predicting Image Rotations

Eric lagger · sept. 2020.

In recent years deep convolutional neural networks achieved a lot of progress. To train such a network a lot of data is required and in supervised learning algorithms it is necessary that the data is labeled. To label data there is a lot of human work needed and this takes a lot of time and money to be done. To avoid the inconveniences that come with this we would like to find systems that don’t need labeled data and therefore are unsupervised learning algorithms. This is the importance of unsupervised algorithms, even though their outcome is not yet on the same qualitative level as supervised algorithms. In this thesis we will discuss an approach of such a system and compare the results to other papers. A deep convolutional neural network is trained to learn the rotations that have been applied to a picture. So we take a large amount of images and apply some simple rotations and the task of the network is to discover in which direction the image has been rotated. The data doesn’t need to be labeled to any category or anything else. As long as all the pictures are upside down we hope to find some high dimensional patterns for the network to learn.

StitchNet: Image Stitching using Autoencoders and Deep Convolutional Neural Networks

Maurice rupp · sept. 2019.

This thesis explores the prospect of artificial neural networks for image processing tasks. More specifically, it aims to achieve the goal of stitching multiple overlapping images to form a bigger, panoramic picture. Until now, this task is solely approached with ”classical”, hardcoded algorithms while deep learning is at most used for specific subtasks. This thesis introduces a novel end-to-end neural network approach to image stitching called StitchNet, which uses a pre-trained autoencoder and deep convolutional networks. Additionally to presenting several new datasets for the task of supervised image stitching with each 120’000 training and 5’000 validation samples, this thesis also conducts various experiments with different kinds of existing networks designed for image superresolution and image segmentation adapted to the task of image stitching. StitchNet outperforms most of the adapted networks in both quantitative as well as qualitative results.

Facial Expression Recognition in the Wild

Luca rolshoven · sept. 2019.

The idea of inferring the emotional state of a subject by looking at their face is nothing new. Neither is the idea of automating this process using computers. Researchers used to computationally extract handcrafted features from face images that had proven themselves to be effective and then used machine learning techniques to classify the facial expressions using these features. Recently, there has been a trend towards using deeplearning and especially Convolutional Neural Networks (CNNs) for the classification of these facial expressions. Researchers were able to achieve good results on images that were taken in laboratories under the same or at least similar conditions. However, these models do not perform very well on more arbitrary face images with different head poses and illumination. This thesis aims to show the challenges of Facial Expression Recognition (FER) in this wild setting. It presents the currently used datasets and the present state-of-the-art results on one of the biggest facial expression datasets currently available. The contributions of this thesis are twofold. Firstly, I analyze three famous neural network architectures and their effectiveness on the classification of facial expressions. Secondly, I present two modifications of one of these networks that lead to the proposed STN-COV model. While this model does not outperform all of the current state-of-the-art models, it does beat several ones of them.

A Study of 3D Reconstruction of Varying Objects with Deformable Parts Models

Raoul grossenbacher · july 2019.

This work covers a new approach to 3D reconstruction. In traditional 3D reconstruction one uses multiple images of the same object to calculate a 3D model by taking information gained from the differences between the images, like camera position, illumination of the images, rotation of the object and so on, to compute a point cloud representing the object. The characteristic trait shared by all these approaches is that one can almost change everything about the image, but it is not possible to change the object itself, because one needs to find correspondences between the images. To be able to use different instances of the same object, we used a 3D DPM model that can find different parts of an object in an image, thereby detecting the correspondences between the different pictures, which we then can use to calculate the 3D model. To take this theory to practise, we gave a 3D DPM model, which was trained to detect cars, pictures of different car brands, where no pair of images showed the same vehicle and used the detected correspondences and the Factorization Method to compute the 3D point cloud. This technique leads to a completely new approach in 3D reconstruction, because changing the object itself was never done before.

Motion deblurring in the wild replication and improvements

Alvaro juan lahiguera · jan. 2019, coma outcome prediction with convolutional neural networks, stefan jonas · oct. 2018, automatic correction of self-introduced errors in source code, sven kellenberger · aug. 2018, neural face transfer: training a deep neural network to face-swap, till nikolaus schnabel · july 2018.

This thesis explores the field of artificial neural networks with realistic looking visual outputs. It aims at morphing face pictures of a specific identity to look like another individual by only modifying key features, such as eye color, while leaving identity-independent features unchanged. Prior works have covered the topic of symmetric translation between two specific domains but failed to optimize it on faces where only parts of the image may be changed. This work applies a face masking operation to the output at training time, which forces the image generator to preserve colors while altering the face, fitting it naturally inside the unmorphed surroundings. Various experiments are conducted including an ablation study on the final setting, decreasing the baseline identity switching performance from 81.7% to 75.8 % whilst improving the average χ2 color distance from 0.551 to 0.434. The provided code-based software gives users easy access to apply this neural face swap to images and videos of arbitrary crop and brings Computer Vision one step closer to replacing Computer Graphics in this specific area.

A Study of the Importance of Parts in the Deformable Parts Model

Sammer puran · june 2017, self-similarity as a meta feature, lucas husi · april 2017, a study of 3d deformable parts models for detection and pose-estimation, simon jenni · march 2015, accelerated federated learning on client silos with label noise: rho selection in classification and segmentation, irakli kelbakiani · may 2024.

Federated Learning has recently gained more research interest. This increased attention is caused by factors including the growth of decentralized data, privacy concerns, and new privacy regulations. In Federated Learning, remote servers keep training a model on local datasets independently, and subsequently, local models are aggregated into a global model, which achieves better overall performance. Sending local model weights instead of the entire dataset is a significant advantage of Federated Learning over centralized classical machine learning algorithms. Federated learning involves uploading and downloading model parameters multiple times, so there are multiple communication rounds between the global server and remote client servers, which imposes challenges. The high number of necessary communication rounds not only increases high-cost communication overheads but is also a critical limitation for servers with low network bandwidth, which leads to latency and a higher probability of training failures caused by communication breakdowns. To mitigate these challenges, we aim to provide a fast-convergent Federated Learning training methodology that decreases the number of necessary communication rounds. We found a paper about Reducible Holdout Loss Selection (RHO-Loss) batch selection methodology, which ”selects low-noise, task-relevant, non-redundant points for training” [1]; we hypothesize, if client silos employ RHO-Loss methodology and successfully avoid training their local models on noisy and non-relevant samples, clients may offer stable and consistent updates to the global server, which could lead to faster convergence of the global model. Our contribution focuses on investigating the RHO-Loss method in a simulated federated setting for the Clothing1M dataset. We also examine its applicability to medical datasets and check its effectiveness in a simulated federated environment. Our experimental results show a promising outcome, specifically a reduction in communication rounds for the Clothing1M dataset. However, as the success of the RHO-Loss selection method depends on the availability of sufficient training data for the target RHO model and for the Irreducible RHO model, we emphasize that our contribution applies to those Federated Learning scenarios where client silos hold enough training data to successfully train and benefit from their RHO model on their local dataset.

Amodal Leaf Segmentation

Nicolas maier · nov. 2023.

Plant phenotyping is the process of measuring and analyzing various traits of plants. It provides essential information on how genetic and environmental factors affect plant growth and development. Manual phenotyping is highly time-consuming; therefore, many computer vision and machine learning based methods have been proposed in the past years to perform this task automatically based on images of the plants. However, the publicly available datasets (in particular, of Arabidopsis thaliana) are limited in size and diversity, making them unsuitable to generalize to new unseen environments. In this work, we propose a complete pipeline able to automatically extract traits of interest from an image of Arabidopsis thaliana. Our method uses a minimal amount of existing annotated data from a source domain to generate a large synthetic dataset adapted to a different target domain (e.g., different backgrounds, lighting conditions, and plant layouts). In addition, unlike the source dataset, the synthetic one provides ground-truth annotations for the occluded parts of the leaves, which are relevant when measuring some characteristics of the plant, e.g., its total area. This synthetic dataset is then used to train a model to perform amodal instance segmentation of the leaves to obtain the total area, leaf count, and color of each plant. To validate our approach, we create a small dataset composed of manually annotated real images of Arabidopsis thaliana, which is used to assess the performance of the models.

Assessment of movement and pose in a hospital bed by ambient and wearable sensor technology in healthy subjects

Tony licata · sept. 2022.

The use of automated systems describing the human motion has become possible in various domains. Most of the proposed systems are designed to work with people moving around in a standing position. Because such system could be interesting in a medical environment, we propose in this work a pipeline that can effectively predict human motion from people lying on beds. The proposed pipeline is tested with a data set composed of 41 participants executing 7 predefined tasks in a bed. The motion of the participants is measured with video cameras, accelerometers and pressure mat. Various experiments are carried with the information retrieved from the data set. Two approaches combining the data from the different measure technologies are explored. The performance of the different carried experiments is measured, and the proposed pipeline is composed with components providing the best results. Later on, we show that the proposed pipeline only needs to use the video cameras, which make the proposed environment easier to implement in real life situations.

Machine Learning Based Prediction of Mental Health Using Wearable-measured Time Series

Seyedeh sharareh mirzargar · sept. 2022.

Depression is the second major cause for years spent in disability and has a growing prevalence in adolescents. The recent Covid-19 pandemic has intensified the situation and limited in-person patient monitoring due to distancing measures. Recent advances in wearable devices have made it possible to record the rest/activity cycle remotely with high precision and in real-world contexts. We aim to use machine learning methods to predict an individual's mental health based on wearable-measured sleep and physical activity. Predicting an impending mental health crisis of an adolescent allows for prompt intervention, detection of depression onset or its recursion, and remote monitoring. To achieve this goal, we train three primary forecasting models; linear regression, random forest, and light gradient boosted machine (LightGBM); and two deep learning models; block recurrent neural network (block RNN) and temporal convolutional network (TCN); on Actigraph measurements to forecast mental health in terms of depression, anxiety, sleepiness, stress, sleep quality, and behavioral problems. Our models achieve a high forecasting performance, the random forest being the winner to reach an accuracy of 98% for forecasting the trait anxiety. We perform extensive experiments to evaluate the models' performance in accuracy, generalization, and feature utilization, using a naive forecaster as the baseline. Our analysis shows minimal mental health changes over two months, making the prediction task easily achievable. Due to these minimal changes in mental health, the models tend to primarily use the historical values of mental health evaluation instead of Actigraph features. At the time of this master thesis, the data acquisition step is still in progress. In future work, we plan to train the models on the complete dataset using a longer forecasting horizon to increase the level of mental health changes and perform transfer learning to compensate for the small dataset size. This interdisciplinary project demonstrates the opportunities and challenges in machine learning based prediction of mental health, paving the way toward using the same techniques to forecast other mental disorders such as internalizing disorder, Parkinson's disease, Alzheimer's disease, etc. and improving the quality of life for individuals who have some mental disorder.

CNN Spike Detector: Detection of Spikes in Intracranial EEG using Convolutional Neural Networks

Stefan jonas · oct. 2021.

The detection of interictal epileptiform discharges in the visual analysis of electroencephalography (EEG) is an important but very difficult, tedious, and time-consuming task. There have been decades of research on computer-assisted detection algorithms, most recently focused on using Convolutional Neural Networks (CNNs). In this thesis, we present the CNN Spike Detector, a convolutional neural network to detect spikes in intracranial EEG. Our dataset of 70 intracranial EEG recordings from 26 subjects with epilepsy introduces new challenges in this research field. We report cross-validation results with a mean AUC of 0.926 (+- 0.04), an area under the precision-recall curve (AUPRC) of 0.652 (+- 0.10) and 12.3 (+- 7.47) false positive epochs per minute for a sensitivity of 80%. A visual examination of false positive segments is performed to understand the model behavior leading to a relatively high false detection rate. We notice issues with the evaluation measures and highlight a major limitation of the common approach of detecting spikes using short segments, namely that the network is not capable to consider the greater context of the segment with regards to its origination. For this reason, we present the Context Model, an extension in which the CNN Spike Detector is supplied with additional information about the channel. Results show promising but limited performance improvements. This thesis provides important findings about the spike detection task for intracranial EEG and lays out promising future research directions to develop a network capable of assisting experts in real-world clinical applications.

PolitBERT - Deepfake Detection of American Politicians using Natural Language Processing

Maurice rupp · april 2021.

This thesis explores the application of modern Natural Language Processing techniques to the detection of artificially generated videos of popular American politicians. Instead of focusing on detecting anomalies and artifacts in images and sounds, this thesis focuses on detecting irregularities and inconsistencies in the words themselves, opening up a new possibility to detect fake content. A novel, domain-adapted, pre-trained version of the language model BERT combined with several mechanisms to overcome severe dataset imbalances yielded the best quantitative as well as qualitative results. Additionally to the creation of the biggest publicly available dataset of English-speaking politicians consisting of 1.5 M sentences from over 1000 persons, this thesis conducts various experiments with different kinds of text classification and sequence processing algorithms applied to the political domain. Furthermore, multiple ablations to manage severe data imbalance are presented and evaluated.

A Study on the Inversion of Generative Adversarial Networks

Ramona beck · march 2021.

The desire to use generative adversarial networks (GANs) for real-world tasks such as object segmentation or image manipulation is increasing as synthesis quality improves, which has given rise to an emerging research area called GAN inversion that focuses on exploring methods for embedding real images into the latent space of a GAN. In this work, we investigate different GAN inversion approaches using an existing generative model architecture that takes a completely unsupervised approach to object segmentation and is based on StyleGAN2. In particular, we propose and analyze algorithms for embedding real images into the different latent spaces Z, W, and W+ of StyleGAN following an optimization-based inversion approach, while also investigating a novel approach that allows fine-tuning of the generator during the inversion process. Furthermore, we investigate a hybrid and a learning-based inversion approach, where in the former we train an encoder with embeddings optimized by our best optimization-based inversion approach, and in the latter we define an autoencoder, consisting of an encoder and the generator of our generative model as a decoder, and train it to map an image into the latent space. We demonstrate the effectiveness of our methods as well as their limitations through a quantitative comparison with existing inversion methods and by conducting extensive qualitative and quantitative experiments with synthetic data as well as real images from a complex image dataset. We show that we achieve qualitatively satisfying embeddings in the W and W+ spaces with our optimization-based algorithms, that fine-tuning the generator during the inversion process leads to qualitatively better embeddings in all latent spaces studied, and that the learning-based approach also benefits from a variable generator as well as a pre-training with our hybrid approach. Furthermore, we evaluate our approaches on the object segmentation task and show that both our optimization-based and our hybrid and learning-based methods are able to generate meaningful embeddings that achieve reasonable object segmentations. Overall, our proposed methods illustrate the potential that lies in the GAN inversion and its application to real-world tasks, especially in the relaxed version of the GAN inversion where the weights of the generator are allowed to vary.

Multi-scale Momentum Contrast for Self-supervised Image Classification

Zhao xueqi · dec. 2020.

With the maturity of supervised learning technology, people gradually shift the research focus to the field of self-supervised learning. ”Momentum Contrast” (MoCo) proposes a new self-supervised learning method and raises the correct rate of self-supervised learning to a new level. Inspired by another article ”Representation Learning by Learning to Count”, if a picture is divided into four parts and passed through a neural network, it is possible to further improve the accuracy of MoCo. Different from the original MoCo, this MoCo variant (Multi-scale MoCo) does not directly pass the image through the encoder after the augmented images. Multi-scale MoCo crops and resizes the augmented images, and the obtained four parts are respectively passed through the encoder and then summed (upsampled version do not do resize to input but resize the contrastive samples). This method of images crop is not only used for queue q but also used for comparison queue k, otherwise the weights of queue k might be damaged during the moment update. This will further discussed in the experiments chapter between downsampled Multi-scale version and downsampled both Multi-scale version. Human beings also have the same principle of object recognition: when human beings see something they are familiar with, even if the object is not fully displayed, people can still guess the object itself with a high probability. Because of this, Multi-scale MoCo applies this concept to the pretext part of MoCo, hoping to obtain better feature extraction. In this thesis, there are three versions of Multi-scale MoCo, downsampled input samples version, downsampled input samples and contrast samples version and upsampled input samples version. The differences between these versions will be described in more detail later. The neural network architecture comparison includes ResNet50 , and the tested data set is STL-10. The weights obtained in pretext will be transferred to self-supervised learning, and in the process of self-supervised learning, the weights of other layers except the final linear layer are frozen without changing (these weights come from pretext).

Self-Supervised Learning Using Siamese Networks and Binary Classifier

Dušan mihajlov · march 2020.

In this thesis, we present several approaches for training a convolutional neural network using only unlabeled data. Our autonomously supervised learning algorithms are based on connections between image patch i. e. zoomed image and its original. Using the siamese architecture neural network we aim to recognize, if the image patch, which is input to the first neural network part, comes from the same image presented to the second neural network part. By applying transformations to both images, and different zoom sizes at different positions, we force the network to extract high level features using its convolutional layers. At the top of our siamese architecture, we have a simple binary classifier that measures the difference between feature maps that we extract and makes a decision. Thus, the only way that the classifier will solve the task correctly is when our convolutional layers are extracting useful representations. Those representations we can than use to solve many different tasks that are related to the data used for unsupervised training. As the main benchmark for all of our models, we used STL10 dataset, where we train a linear classifier on the top of our convolutional layers with a small amount of manually labeled images, which is a widely used benchmark for unsupervised learning tasks. We also combine our idea with recent work on the same topic, and the network called RotNet, which makes use of image rotations and therefore forces the network to learn rotation dependent features from the dataset. As a result of this combination we create a new procedure that outperforms original RotNet.

Learning Object Representations by Mixing Scenes

Lukas zbinden · may 2019.

In the digital age of ever increasing data amassment and accessibility, the demand for scalable machine learning models effective at refining the new oil is unprecedented. Unsupervised representation learning methods present a promising approach to exploit this invaluable yet unlabeled digital resource at scale. However, a majority of these approaches focuses on synthetic or simplified datasets of images. What if a method could learn directly from natural Internet-scale image data? In this thesis, we propose a novel approach for unsupervised learning of object representations by mixing natural image scenes. Without any human help, our method mixes visually similar images to synthesize new realistic scenes using adversarial training. In this process the model learns to represent and understand the objects prevalent in natural image data and makes them available for downstream applications. For example, it enables the transfer of objects from one scene to another. Through qualitative experiments on complex image data we show the effectiveness of our method along with its limitations. Moreover, we benchmark our approach quantitatively against state-of-the-art works on the STL-10 dataset. Our proposed method demonstrates the potential that lies in learning representations directly from natural image data and reinforces it as a promising avenue for future research.

Representation Learning using Semantic Distances

Markus roth · may 2019, zero-shot learning using generative adversarial networks, hamed hemati · dec. 2018, dimensionality reduction via cnns - learning the distance between images, ioannis glampedakis · sept. 2018, learning to play othello using deep reinforcement learning and self play, thomas simon steinmann · sept. 2018, aba-j interactive multi-modality tissue sectionto-volume alignment: a brain atlasing toolkit for imagej, felix meyenhofer · march 2018, learning visual odometry with recurrent neural networks, adrian wälchli · feb. 2018.

In computer vision, Visual Odometry is the problem of recovering the camera motion from a video. It is related to Structure from Motion, the problem of reconstructing the 3D geometry from a collection of images. Decades of research in these areas have brought successful algorithms that are used in applications like autonomous navigation, motion capture, augmented reality and others. Despite the success of these prior works in real-world environments, their robustness is highly dependent on manual calibration and the magnitude of noise present in the images in form of, e.g., non-Lambertian surfaces, dynamic motion and other forms of ambiguity. This thesis explores an alternative approach to the Visual Odometry problem via Deep Learning, that is, a specific form of machine learning with artificial neural networks. It describes and focuses on the implementation of a recent work that proposes the use of Recurrent Neural Networks to learn dependencies over time due to the sequential nature of the input. Together with a convolutional neural network that extracts motion features from the input stream, the recurrent part accumulates knowledge from the past to make camera pose estimations at each point in time. An analysis on the performance of this system is carried out on real and synthetic data. The evaluation covers several ways of training the network as well as the impact and limitations of the recurrent connection for Visual Odometry.

Crime location and timing prediction

Bernard swart · jan. 2018, from cartoons to real images: an approach to unsupervised visual representation learning, simon jenni · feb. 2017, automatic and large-scale assessment of fluid in retinal oct volume, nina mujkanovic · dec. 2016, segmentation in 3d using eye-tracking technology, michele wyss · july 2016, accurate scale thresholding via logarithmic total variation prior, remo diethelm · aug. 2014, novel techniques for robust and generalizable machine learning, abdelhak lemkhenter · sept. 2023.

Neural networks have transcended their status of powerful proof-of-concept machine learning into the realm of a highly disruptive technology that has revolutionized many quantitative fields such as drug discovery, autonomous vehicles, and machine translation. Today, it is nearly impossible to go a single day without interacting with a neural network-powered application. From search engines to on-device photo-processing, neural networks have become the go-to solution thanks to recent advances in computational hardware and an unprecedented scale of training data. Larger and less curated datasets, typically obtained through web crawling, have greatly propelled the capabilities of neural networks forward. However, this increase in scale amplifies certain challenges associated with training such models. Beyond toy or carefully curated datasets, data in the wild is plagued with biases, imbalances, and various noisy components. Given the larger size of modern neural networks, such models run the risk of learning spurious correlations that fail to generalize beyond their training data. This thesis addresses the problem of training more robust and generalizable machine learning models across a wide range of learning paradigms for medical time series and computer vision tasks. The former is a typical example of a low signal-to-noise ratio data modality with a high degree of variability between subjects and datasets. There, we tailor the training scheme to focus on robust patterns that generalize to new subjects and ignore the noisier and subject-specific patterns. To achieve this, we first introduce a physiologically inspired unsupervised training task and then extend it by explicitly optimizing for cross-dataset generalization using meta-learning. In the context of image classification, we address the challenge of training semi-supervised models under class imbalance by designing a novel label refinement strategy with higher local sensitivity to minority class samples while preserving the global data distribution. Lastly, we introduce a new Generative Adversarial Networks training loss. Such generative models could be applied to improve the training of subsequent models in the low data regime by augmenting the dataset using generated samples. Unfortunately, GAN training relies on a delicate balance between its components, making it prone mode collapse. Our contribution consists of defining a more principled GAN loss whose gradients incentivize the generator model to seek out missing modes in its distribution. All in all, this thesis tackles the challenge of training more robust machine learning models that can generalize beyond their training data. This necessitates the development of methods specifically tailored to handle the diverse biases and spurious correlations inherent in the data. It is important to note that achieving greater generalizability in models goes beyond simply increasing the volume of data; it requires meticulous consideration of training objectives and model architecture. By tackling these challenges, this research contributes to advancing the field of machine learning and underscores the significance of thoughtful design in obtaining more resilient and versatile models.

Automated Sleep Scoring, Deep Learning and Physician Supervision

Luigi fiorillo · oct. 2022.

Sleep plays a crucial role in human well-being. Polysomnography is used in sleep medicine as a diagnostic tool, so as to objectively analyze the quality of sleep. Sleep scoring is the procedure of extracting sleep cycle information from the wholenight electrophysiological signals. The scoring is done worldwide by the sleep physicians according to the official American Academy of Sleep Medicine (AASM) scoring manual. In the last decades, a wide variety of deep learning based algorithms have been proposed to automatise the sleep scoring task. In this thesis we study the reasons why these algorithms fail to be introduced in the daily clinical routine, with the perspective of bridging the existing gap between the automatic sleep scoring models and the sleep physicians. In this light, the primary step is the design of a simplified sleep scoring architecture, also providing an estimate of the model uncertainty. Beside achieving results on par with most up-to-date scoring systems, we demonstrate the efficiency of ensemble learning based algorithms, together with label smoothing techniques, in both enhancing the performance and calibrating the simplified scoring model. We introduced an uncertainty estimate procedure, so as to identify the most challenging sleep stage predictions, and to quantify the disagreement between the predictions given by the model and the annotation given by the physicians. In this thesis we also propose a novel method to integrate the inter-scorer variability into the training procedure of a sleep scoring model. We clearly show that a deep learning model is able to encode this variability, so as to better adapt to the consensus of a group of scorers-physicians. We finally address the generalization ability of a deep learning based sleep scoring system, further studying its resilience to the sleep complexity and to the AASM scoring rules. We can state that there is no need to train the algorithm strictly following the AASM guidelines. Most importantly, using data from multiple data centers results in a better performing model compared with training on a single data cohort. The variability among different scorers and data centers needs to be taken into account, more than the variability among sleep disorders.

Learning Representations for Controllable Image Restoration

Givi meishvili · march 2022.

Deep Convolutional Neural Networks have sparked a renaissance in all the sub-fields of computer vision. Tremendous progress has been made in the area of image restoration. The research community has pushed the boundaries of image deblurring, super-resolution, and denoising. However, given a distorted image, most existing methods typically produce a single restored output. The tasks mentioned above are inherently ill-posed, leading to an infinite number of plausible solutions. This thesis focuses on designing image restoration techniques capable of producing multiple restored results and granting users more control over the restoration process. Towards this goal, we demonstrate how one could leverage the power of unsupervised representation learning. Image restoration is vital when applied to distorted images of human faces due to their social significance. Generative Adversarial Networks enable an unprecedented level of generated facial details combined with smooth latent space. We leverage the power of GANs towards the goal of learning controllable neural face representations. We demonstrate how to learn an inverse mapping from image space to these latent representations, tuning these representations towards a specific task, and finally manipulating latent codes in these spaces. For example, we show how GANs and their inverse mappings enable the restoration and editing of faces in the context of extreme face super-resolution and the generation of novel view sharp videos from a single motion-blurred image of a face. This thesis also addresses more general blind super-resolution, denoising, and scratch removal problems, where blur kernels and noise levels are unknown. We resort to contrastive representation learning and first learn the latent space of degradations. We demonstrate that the learned representation allows inference of ground-truth degradation parameters and can guide the restoration process. Moreover, it enables control over the amount of deblurring and denoising in the restoration via manipulation of latent degradation features.

Learning Generalizable Visual Patterns Without Human Supervision

Simon jenni · oct. 2021.

Owing to the existence of large labeled datasets, Deep Convolutional Neural Networks have ushered in a renaissance in computer vision. However, almost all of the visual data we generate daily - several human lives worth of it - remains unlabeled and thus out of reach of today’s dominant supervised learning paradigm. This thesis focuses on techniques that steer deep models towards learning generalizable visual patterns without human supervision. Our primary tool in this endeavor is the design of Self-Supervised Learning tasks, i.e., pretext-tasks for which labels do not involve human labor. Besides enabling the learning from large amounts of unlabeled data, we demonstrate how self-supervision can capture relevant patterns that supervised learning largely misses. For example, we design learning tasks that learn deep representations capturing shape from images, motion from video, and 3D pose features from multi-view data. Notably, these tasks’ design follows a common principle: The recognition of data transformations. The strong performance of the learned representations on downstream vision tasks such as classiﬁcation, segmentation, action recognition, or pose estimation validate this pretext-task design. This thesis also explores the use of Generative Adversarial Networks (GANs) for unsupervised representation learning. Besides leveraging generative adversarial learning to deﬁne image transformation for self-supervised learning tasks, we also address training instabilities of GANs through the use of noise. While unsupervised techniques can signiﬁcantly reduce the burden of supervision, in the end, we still rely on some annotated examples to ﬁne-tune learned representations towards a target task. To improve the learning from scarce or noisy labels, we describe a supervised learning algorithm with improved generalization in these challenging settings.

Learning Interpretable Representations of Images

Attila szabó · june 2019.

Computers represent images with pixels and each pixel contains three numbers for red, green and blue colour values. These numbers are meaningless for humans and they are mostly useless when used directly with classical machine learning techniques like linear classifiers. Interpretable representations are the attributes that humans understand: the colour of the hair, viewpoint of a car or the 3D shape of the object in the scene. Many computer vision tasks can be viewed as learning interpretable representations, for example a supervised classification algorithm directly learns to represent images with their class labels. In this work we aim to learn interpretable representations (or features) indirectly with lower levels of supervision. This approach has the advantage of cost savings on dataset annotations and the flexibility of using the features for multiple follow-up tasks. We made contributions in three main areas: weakly supervised learning, unsupervised learning and 3D reconstruction. In the weakly supervised case we use image pairs as supervision. Each pair shares a common attribute and differs in a varying attribute. We propose a training method that learns to separate the attributes into separate feature vectors. These features then are used for attribute transfer and classification. We also show theoretical results on the ambiguities of the learning task and the ways to avoid degenerate solutions. We show a method for unsupervised representation learning, that separates semantically meaningful concepts. We explain and show ablation studies how the components of our proposed method work: a mixing autoencoder, a generative adversarial net and a classifier. We propose a method for learning single image 3D reconstruction. It is done using only the images, no human annotation, stereo, synthetic renderings or ground truth depth map is needed. We train a generative model that learns the 3D shape distribution and an encoder to reconstruct the 3D shape. For that we exploit the notion of image realism. It means that the 3D reconstruction of the object has to look realistic when it is rendered from different random angles. We prove the efficacy of our method from first principles.

Learning Controllable Representations for Image Synthesis

Qiyang hu · june 2019.

In this thesis, our focus is learning a controllable representation and applying the learned controllable feature representation on images synthesis, video generation, and even 3D reconstruction. We propose different methods to disentangle the feature representation in neural network and analyze the challenges in disentanglement such as reference ambiguity and shortcut problem when using the weak label. We use the disentangled feature representation to transfer attributes between images such as exchanging hairstyle between two face images. Furthermore, we study the problem of how another type of feature, sketch, works in a neural network. The sketch can provide shape and contour of an object such as the silhouette of the side-view face. We leverage the silhouette constraint to improve the 3D face reconstruction from 2D images. The sketch can also provide the moving directions of one object, thus we investigate how one can manipulate the object to follow the trajectory provided by a user sketch. We propose a method to automatically generate video clips from a single image input using the sketch as motion and trajectory guidance to animate the object in that image. We demonstrate the efficiency of our approaches on several synthetic and real datasets.

Beyond Supervised Representation Learning

Mehdi noroozi · jan. 2019.

The complexity of any information processing task is highly dependent on the space where data is represented. Unfortunately, pixel space is not appropriate for the computer vision tasks such as object classification. The traditional computer vision approaches involve a multi-stage pipeline where at first images are transformed to a feature space through a handcrafted function and then consequenced by the solution in the feature space. The challenge with this approach is the complexity of designing handcrafted functions that extract robust features. The deep learning based approaches address this issue by end-to-end training of a neural network for some tasks that lets the network to discover the appropriate representation for the training tasks automatically. It turns out that image classification task on large scale annotated datasets yields a representation transferable to other computer vision tasks. However, supervised representation learning is limited to annotations. In this thesis we study self-supervised representation learning where the goal is to alleviate these limitations by substituting the classification task with pseudo tasks where the labels come for free. We discuss self-supervised learning by solving jigsaw puzzles that uses context as supervisory signal. The rational behind this task is that the network requires to extract features about object parts and their spatial configurations to solve the jigsaw puzzles. We also discuss a method for representation learning that uses an artificial supervisory signal based on counting visual primitives. This supervisory signal is obtained from an equivariance relation. We use two image transformations in the context of counting: scaling and tiling. The first transformation exploits the fact that the number of visual primitives should be invariant to scale. The second transformation allows us to equate the total number of visual primitives in each tile to that in the whole image. The most effective transfer strategy is fine-tuning, which restricts one to use the same model or parts thereof for both pretext and target tasks. We discuss a novel framework for self-supervised learning that overcomes limitations in designing and comparing different tasks, models, and data domains. In particular, our framework decouples the structure of the self-supervised model from the final task-specific finetuned model. Finally, we study the problem of multi-task representation learning. A naive approach to enhance the representation learned by a task is to train the task jointly with other tasks that capture orthogonal attributes. Having a diverse set of auxiliary tasks, imposes challenges on multi-task training from scratch. We propose a framework that allows us to combine arbitrarily different feature spaces into a single deep neural network. We reduce the auxiliary tasks to classification tasks and the multi-task learning to multi-label classification task consequently. Nevertheless, combining multiple representation space without being aware of the target task might be suboptimal. As our second contribution, we show empirically that this is indeed the case and propose to combine multiple tasks after the fine-tuning on the target task.

Motion Deblurring from a Single Image

Meiguang jin · dec. 2018.

With the information explosion, a tremendous amount photos is captured and shared via social media everyday. Technically, a photo requires a finite exposure to accumulate light from the scene. Thus, objects moving during the exposure generate motion blur in a photo. Motion blur is an image degradation that makes visual content less interpretable and is therefore often seen as a nuisance. Although motion blur can be reduced by setting a short exposure time, an insufficient amount of light has to be compensated through increasing the sensor’s sensitivity, which will inevitably bring large amount of sensor noise. Thus this motivates the necessity of removing motion blur computationally. Motion deblurring is an important problem in computer vision and it is challenging due to its ill-posed nature, which means the solution is not well defined. Mathematically, a blurry image caused by uniform motion is formed by the convolution operation between a blur kernel and a latent sharp image. Potentially there are infinite pairs of blur kernel and latent sharp image that can result in the same blurry image. Hence, some prior knowledge or regularization is required to address this problem. Even if the blur kernel is known, restoring the latent sharp image is still difficult as the high frequency information has been removed. Although we can model the uniform motion deblurring problem mathematically, it can only address the camera in-plane translational motion. Practically, motion is more complicated and can be non-uniform. Non-uniform motion blur can come from many sources, camera out-of-plane rotation, scene depth change, object motion and so on. Thus, it is more challenging to remove non-uniform motion blur. In this thesis, our focus is motion blur removal. We aim to address four challenging motion deblurring problems. We start from the noise blind image deblurring scenario where blur kernel is known but the noise level is unknown. We introduce an efficient and robust solution based on a Bayesian framework using a smooth generalization of the 0−1 loss to address this problem. Then we study the blind uniform motion deblurring scenario where both the blur kernel and the latent sharp image are unknown. We explore the relative scale ambiguity between the latent sharp image and blur kernel to address this issue. Moreover, we study the face deblurring problem and introduce a novel deep learning network architecture to solve it. We also address the general motion deblurring problem and particularly we aim at recovering a sequence of 7 frames each depicting some instantaneous motion of the objects in the scene.

Towards a Novel Paradigm in Blind Deconvolution: From Natural to Cartooned Image Statistics

Daniele perrone · july 2015.

In this thesis we study the blind deconvolution problem. Blind deconvolution consists in the estimation of a sharp image and a blur kernel from an observed blurry image. Because the blur model admits several solutions it is necessary to devise an image prior that favors the true blur kernel and sharp image. Recently it has been shown that a class of blind deconvolution formulations and image priors has the no-blur solution as global minimum. Despite this shortcoming, algorithms based on these formulations and priors can successfully solve blind deconvolution. In this thesis we show that a suitable initialization can exploit the non-convexity of the problem and yield the desired solution. Based on these conclusions, we propose a novel “vanilla” algorithm stripped of any enhancement typically used in the literature. Our algorithm, despite its simplicity, is able to compete with the top performers on several datasets. We have also investigated a remarkable behavior of a 1998 algorithm, whose formulation has the no-blur solution as global minimum: even when initialized at the no-blur solution, it converges to the correct solution. We show that this behavior is caused by an apparently insignificant implementation strategy that makes the algorithm no longer minimize the original cost functional. We also demonstrate that this strategy improves the results of our “vanilla” algorithm. Finally, we present a study of image priors for blind deconvolution. We provide experimental evidence supporting the recent belief that a good image prior is one that leads to a good blur estimate rather than being a good natural image statistical model. By focusing the attention on the blur estimation alone, we show that good blur estimates can be obtained even when using images quite different from the true sharp image. This allows using image priors, such as those leading to “cartooned” images, that avoid the no-blur solution. By using an image prior that produces “cartooned” images we achieve state-of-the-art results on different publicly available datasets. We therefore suggests a shift of paradigm in blind deconvolution: from modeling natural image statistics to modeling cartooned image statistics.

New Perspectives on Uncalibrated Photometric Stereo

Thoma papadhimitri · june 2014.

This thesis investigates the problem of 3D reconstruction of a scene from 2D images. In particular, we focus on photometric stereo which is a technique that computes the 3D geometry from at least three images taken from the same viewpoint and under different illumination conditions. When the illumination is unknown (uncalibrated photometric stereo) the problem is ambiguous: different combinations of geometry and illumination can generate the same images. First, we solve the ambiguity by exploiting the Lambertian reflectance maxima. These are points defined on curved surfaces where the normals are parallel to the light direction. Then, we propose a solution that can be computed in closed-form and thus very efficiently. Our algorithm is also very robust and yields always the same estimate regardless of the initial ambiguity. We validate our method on real world experiments and achieve state-of-art results. In this thesis we also solve for the first time the uncalibrated photometric stereo problem under the perspective projection model. We show that unlike in the orthographic case, one can uniquely reconstruct the normals of the object and the lights given only the input images and the camera calibration (focal length and image center). We also propose a very efficient algorithm which we validate on synthetic and real world experiments and show that the proposed technique is a generalization of the orthographic case. Finally, we investigate the uncalibrated photometric stereo problem in the case where the lights are distributed near the scene. In this case we propose an alternating minimization technique which converges quickly and overcomes the limitations of prior work that assumes distant illumination. We show experimentally that adopting a near-light model for real world scenes yields very accurate reconstructions.

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

Title: towards few-annotation learning in computer vision: application to image classification and object detection tasks.

Abstract: In this thesis, we develop theoretical, algorithmic and experimental contributions for Machine Learning with limited labels, and more specifically for the tasks of Image Classification and Object Detection in Computer Vision. In a first contribution, we are interested in bridging the gap between theory and practice for popular Meta-Learning algorithms used in Few-Shot Classification. We make connections to Multi-Task Representation Learning, which benefits from solid theoretical foundations, to verify the best conditions for a more efficient meta-learning. Then, to leverage unlabeled data when training object detectors based on the Transformer architecture, we propose both an unsupervised pretraining and a semi-supervised learning method in two other separate contributions. For pretraining, we improve Contrastive Learning for object detectors by introducing the localization information. Finally, our semi-supervised method is the first tailored to transformer-based detectors.

Submission history

Access paper:.

Other Formats

References & Citations

Google Scholar
Semantic Scholar

BibTeX formatted citation

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Subscribe to the PwC Newsletter

Join the community, computer vision, semantic segmentation.

Tumor Segmentation

Panoptic Segmentation

3D Semantic Segmentation

Weakly-Supervised Semantic Segmentation

Representation learning.

Disentanglement

Graph representation learning, sentence embeddings.

Network Embedding

Classification.

Text Classification

Graph Classification

Audio Classification

Medical Image Classification

Object detection.

3D Object Detection

Real-Time Object Detection

RGB Salient Object Detection

Few-Shot Object Detection

Image classification.

Out of Distribution (OOD) Detection

Few-Shot Image Classification

Fine-Grained Image Classification

Semi-Supervised Image Classification

2d object detection.

Edge Detection

Thermal image segmentation.

Open Vocabulary Object Detection

Reinforcement learning (rl), off-policy evaluation, multi-objective reinforcement learning, 3d point cloud reinforcement learning, deep hashing, table retrieval, domain adaptation.

Unsupervised Domain Adaptation

Domain Generalization

Test-time Adaptation

Source-free domain adaptation, image generation.

Image-to-Image Translation

Text-to-Image Generation

Image Inpainting

Conditional Image Generation

Data augmentation.

Image Augmentation

Text Augmentation

Autonomous vehicles.

Autonomous Driving

Self-Driving Cars

Simultaneous Localization and Mapping

Autonomous Navigation

Image Denoising

Color Image Denoising

Sar Image Despeckling

Grayscale image denoising, meta-learning.

Few-Shot Learning

Sample Probing

Universal meta-learning, contrastive learning.

Super-Resolution

Image Super-Resolution

Video Super-Resolution

Multi-Frame Super-Resolution

Reference-based Super-Resolution

Pose estimation.

3D Human Pose Estimation

Keypoint Detection

3D Pose Estimation

6D Pose Estimation

Self-supervised learning.

Point Cloud Pre-training

Unsupervised video clustering, 2d semantic segmentation, image segmentation, text style transfer.

Scene Parsing

Reflection Removal

Visual question answering (vqa).

Visual Question Answering

Machine Reading Comprehension

Chart Question Answering

Embodied Question Answering

Depth Estimation

3D Reconstruction

Neural Rendering

3D Face Reconstruction

Sentiment analysis.

Aspect-Based Sentiment Analysis (ABSA)

Multimodal Sentiment Analysis

Aspect Sentiment Triplet Extraction

Twitter Sentiment Analysis

Anomaly detection.

Unsupervised Anomaly Detection

One-Class Classification

Supervised anomaly detection, anomaly detection in surveillance videos.

Temporal Action Localization

Video Understanding

Video generation.

Video Object Segmentation

Action Classification

3d object super-resolution, activity recognition.

Action Recognition

Human Activity Recognition

Egocentric activity recognition.

Group Activity Recognition

One-Shot Learning

Few-Shot Semantic Segmentation

Cross-domain few-shot.

Unsupervised Few-Shot Learning

Medical image segmentation.

Lesion Segmentation

Brain Tumor Segmentation

Cell Segmentation

Skin lesion segmentation, monocular depth estimation.

Stereo Depth Estimation

Depth and camera motion.

3D Depth Estimation

Exposure fairness, optical character recognition (ocr).

Active Learning

Handwriting Recognition

Handwritten digit recognition, irregular text recognition, instance segmentation.

Referring Expression Segmentation

3D Instance Segmentation

Real-time Instance Segmentation

Unsupervised Object Segmentation

Facial recognition and modelling.

Face Recognition

Face Swapping

Face Detection

Facial Expression Recognition (FER)

Face Verification

Object tracking.

Multi-Object Tracking

Visual Object Tracking

Multiple Object Tracking

Cell Tracking

Zero-shot learning.

Generalized Zero-Shot Learning

Compositional Zero-Shot Learning

Multi-label zero-shot learning, quantization, data free quantization, unet quantization, continual learning.

Class Incremental Learning

Continual named entity recognition, unsupervised class-incremental learning.

Action Recognition In Videos

3D Action Recognition

Self-supervised action recognition, few shot action recognition.

Scene Understanding

Scene Text Recognition

Scene Graph Generation

Scene Recognition

Adversarial attack.

Backdoor Attack

Adversarial Text

Adversarial attack detection, real-world adversarial attack, active object detection, image retrieval.

Sketch-Based Image Retrieval

Content-Based Image Retrieval

Composed Image Retrieval (CoIR)

Medical Image Retrieval

Dimensionality reduction.

Supervised dimensionality reduction

Online nonnegative cp decomposition, emotion recognition.

Speech Emotion Recognition

Emotion Recognition in Conversation

Multimodal Emotion Recognition

Emotion-cause pair extraction.

Monocular 3D Object Detection

3D Object Detection From Stereo Images

Multiview Detection

Robust 3d object detection, image reconstruction.

MRI Reconstruction

Film Removal

Style transfer.

Image Stylization

Font style transfer, style generalization, face transfer, optical flow estimation.

Video Stabilization

Image captioning.

3D dense captioning

Controllable image captioning, aesthetic image captioning.

Relational Captioning

Action localization.

Action Segmentation

Spatio-temporal action localization, person re-identification.

Unsupervised Person Re-Identification

Video-based person re-identification, generalizable person re-identification, cloth-changing person re-identification, image restoration.

Demosaicking

Spectral reconstruction, underwater image restoration.

JPEG Artifact Correction

Visual relationship detection, lighting estimation.

3D Room Layouts From A Single RGB Panorama

Road scene understanding, action detection.

Skeleton Based Action Recognition

Online Action Detection

Audio-visual active speaker detection, metric learning.

Object Recognition

3D Object Recognition

Continuous object recognition.

Depiction Invariant Object Recognition

Monocular 3D Human Pose Estimation

Pose prediction.

3D Multi-Person Pose Estimation

3d human pose and shape estimation, image enhancement.

Low-Light Image Enhancement

Image relighting, de-aliasing, multi-label classification.

Missing Labels

Extreme multi-label classification, hierarchical multi-label classification, medical code prediction, continuous control.

Steering Control

Drone controller.

Semi-Supervised Video Object Segmentation

Unsupervised Video Object Segmentation

Referring Video Object Segmentation

Video Salient Object Detection

3d face modelling.

Trajectory Prediction

Trajectory Forecasting

Human motion prediction, out-of-sight trajectory prediction.

Multivariate Time Series Imputation

Image quality assessment, no-reference image quality assessment, blind image quality assessment.

Aesthetics Quality Assessment

Stereoscopic image quality assessment, object localization.

Weakly-Supervised Object Localization

Image-based localization, unsupervised object localization, monocular 3d object localization, novel view synthesis.

Novel LiDAR View Synthesis

Gournd video synthesis from satellite image

Blind Image Deblurring

Single-image blind deblurring, out-of-distribution detection, video semantic segmentation.

Camera shot segmentation

Cloud removal.

Facial Inpainting

Fine-Grained Image Inpainting

Instruction following, visual instruction following, change detection.

Semi-supervised Change Detection

Saliency detection.

Saliency Prediction

Co-Salient Object Detection

Video saliency detection, unsupervised saliency detection, image compression.

Feature Compression

Jpeg compression artifact reduction.

Lossy-Compression Artifact Reduction

Color image compression artifact reduction, explainable artificial intelligence, explainable models, explanation fidelity evaluation, fad curve analysis, prompt engineering.

Visual Prompting

Image registration.

Unsupervised Image Registration

Ensemble learning, visual reasoning.

Visual Commonsense Reasoning

Salient object detection, saliency ranking, visual tracking.

Point Tracking

Rgb-t tracking, real-time visual tracking.

RF-based Visual Tracking

3d point cloud classification.

3D Object Classification

Few-Shot 3D Point Cloud Classification

Supervised only 3d point cloud classification, zero-shot transfer 3d point cloud classification, motion estimation, 2d classification.

Neural Network Compression

Music Source Separation

Cell detection.

Plant Phenotyping

Open-set classification, image manipulation detection.

Zero Shot Skeletal Action Recognition

Generalized zero shot skeletal action recognition, whole slide images, activity prediction, motion prediction, cyber attack detection, sequential skip prediction, gesture recognition.

Hand Gesture Recognition

Hand-Gesture Recognition

RF-based Gesture Recognition

Video captioning.

Dense Video Captioning

Boundary captioning, visual text correction, audio-visual video captioning, video question answering.

Zero-Shot Video Question Answer

Few-shot video question answering.

Robust 3D Semantic Segmentation

Real-Time 3D Semantic Segmentation

Unsupervised 3D Semantic Segmentation

Furniture segmentation, point cloud registration.

Image to Point Cloud Registration

Text detection, medical diagnosis.

Alzheimer's Disease Detection

Retinal OCT Disease Classification

Blood cell count, thoracic disease classification, 3d point cloud interpolation, visual grounding.

Person-centric Visual Grounding

Phrase Extraction and Grounding (PEG)

Visual odometry.

Face Anti-Spoofing

Monocular visual odometry.

Hand Pose Estimation

Hand Segmentation

Gesture-to-gesture translation, rain removal.

Single Image Deraining

Image clustering.

Online Clustering

Face Clustering

Multi-view subspace clustering, multi-modal subspace clustering.

Image Dehazing

Single Image Dehazing

Colorization.

Line Art Colorization

Point-interactive Image Colorization

Color Mismatch Correction

Robot navigation.

PointGoal Navigation

Social navigation.

Sequential Place Learning

Image manipulation, conformal prediction.

Unsupervised Image-To-Image Translation

Synthetic-to-Real Translation

Multimodal Unsupervised Image-To-Image Translation

Cross-View Image-to-Image Translation

Fundus to Angiography Generation

Visual place recognition.

Indoor Localization

3d place recognition, image editing, rolling shutter correction, shadow removal, multimodel-guided image editing, joint deblur and frame interpolation, multimodal fashion image editing, visual localization.

DeepFake Detection

Synthetic Speech Detection

Human detection of deepfakes, multimodal forgery detection, stereo matching, object reconstruction.

3D Object Reconstruction

Crowd Counting

Visual Crowd Analysis

Group detection in crowds, human-object interaction detection.

Affordance Recognition

Image deblurring, low-light image deblurring and enhancement, earth observation, video quality assessment, video alignment, temporal sentence grounding, long-video activity recognition, point cloud classification, jet tagging, few-shot point cloud classification, image matching.

Semantic correspondence

Patch matching, set matching.

Matching Disparate Images

Hyperspectral.

Hyperspectral Image Classification

Hyperspectral unmixing, hyperspectral image segmentation, classification of hyperspectral images, document text classification.

Learning with noisy labels

Multi-label classification of biomedical texts, political salient issue orientation detection, 3d point cloud reconstruction.

Weakly Supervised Action Localization

Weakly-supervised temporal action localization.

Temporal Action Proposal Generation

Activity recognition in videos, scene classification.

2D Human Pose Estimation

Action anticipation.

3D Face Animation

Semi-supervised human pose estimation, point cloud generation, point cloud completion, referring expression, reconstruction, 3d human reconstruction.

Single-View 3D Reconstruction

4d reconstruction, single-image-based hdr reconstruction, compressive sensing, keyword spotting.

Small-Footprint Keyword Spotting

Visual keyword spotting, scene text detection.

Curved Text Detection

Multi-oriented scene text detection, boundary detection.

Junction Detection

Camera calibration, image matting.

Semantic Image Matting

Video retrieval, video-text retrieval, video grounding, video-adverb retrieval, replay grounding, composed video retrieval (covr), motion synthesis.

Motion Style Transfer

Temporal human motion composition, emotion classification.

Video Summarization

Unsupervised Video Summarization

Supervised video summarization, document ai, document understanding, sensor fusion, superpixels, point cloud segmentation, remote sensing.

Remote Sensing Image Classification

Change detection for remote sensing images, building change detection for remote sensing images.

Segmentation Of Remote Sensing Imagery

The Semantic Segmentation Of Remote Sensing Imagery

Few-Shot Transfer Learning for Saliency Prediction

Aerial Video Saliency Prediction

Document layout analysis.

3D Anomaly Detection

Video anomaly detection, artifact detection.

Point cloud reconstruction

3D Semantic Scene Completion

3D Semantic Scene Completion from a single RGB image

Garment reconstruction, face generation.

Talking Head Generation

Talking face generation.

Face Age Editing

Facial expression generation, kinship face generation, cross-modal retrieval, image-text matching, multilingual cross-modal retrieval.

Zero-shot Composed Person Retrieval

Cross-modal retrieval on rsitmd, video instance segmentation.

Privacy Preserving Deep Learning

Membership inference attack, human detection.

Generalized Few-Shot Semantic Segmentation

Virtual try-on, scene flow estimation.

Self-supervised Scene Flow Estimation

3d classification, depth completion.

Motion Forecasting

Multi-Person Pose forecasting

Multiple Object Forecasting

Video editing, video temporal consistency, face reconstruction, object discovery, carla map leaderboard, dead-reckoning prediction.

Generalized Referring Expression Segmentation

Gaze estimation.

Texture Synthesis

Text-based Image Editing

Text-guided-image-editing.

Zero-Shot Text-to-Image Generation

Concept alignment, conditional text-to-image synthesis, machine unlearning, continual forgetting, sign language recognition.

Image Recognition

Fine-grained image recognition, license plate recognition, material recognition, multi-view learning, incomplete multi-view clustering.

Breast Cancer Detection

Skin cancer classification.

Breast Cancer Histology Image Classification

Lung cancer diagnosis, classification of breast cancer histology images, gait recognition.

Multiview Gait Recognition

Gait recognition in the wild, human parsing.

Multi-Human Parsing

Pose tracking.

3D Human Pose Tracking

Interactive segmentation, scene generation.

3D Multi-Person Pose Estimation (absolute)

3D Multi-Person Pose Estimation (root-relative)

3D Multi-Person Mesh Recovery

Event-based vision.

Event-based Optical Flow

Event-Based Video Reconstruction

Event-based motion estimation, disease prediction, disease trajectory forecasting, object counting, training-free object counting, open-vocabulary object counting, interest point detection, homography estimation.

3D Hand Pose Estimation

Weakly supervised segmentation, facial landmark detection.

Unsupervised Facial Landmark Detection

3D Facial Landmark Localization

3d character animation from a single photo, scene segmentation.

Dichotomous Image Segmentation

Activity detection, inverse rendering, temporal localization.

Language-Based Temporal Localization

Temporal defect localization, multi-label image classification.

Multi-label Image Recognition with Partial Labels

3d object tracking.

3D Single Object Tracking

Template matching, text-to-video generation, text-to-video editing, subject-driven video generation, camera localization.

Camera Relocalization

Lidar semantic segmentation, visual dialog.

Motion Segmentation

Relation network, intelligent surveillance.

Vehicle Re-Identification

Text spotting.

Disparity Estimation

Few-Shot Class-Incremental Learning

Class-incremental semantic segmentation, non-exemplar-based class incremental learning, handwritten text recognition, handwritten document recognition, unsupervised text recognition, knowledge distillation.

Data-free Knowledge Distillation

Self-knowledge distillation, moment retrieval.

Zero-shot Moment Retrieval

Text to video retrieval, partially relevant video retrieval, person search, decision making under uncertainty.

Uncertainty Visualization

Semi-supervised object detection.

Shadow Detection

Shadow Detection And Removal

Unconstrained Lip-synchronization

Mixed reality, video inpainting.

Cross-corpus

Micro-expression recognition, micro-expression spotting.

3D Facial Expression Recognition

Smile Recognition

Future prediction, human mesh recovery, video enhancement.

Face Image Quality Assessment

Lightweight face recognition.

Age-Invariant Face Recognition

Synthetic face recognition, face quality assessement.

3D Multi-Object Tracking

Real-time multi-object tracking, multi-animal tracking with identification, trajectory long-tail distribution for muti-object tracking, grounded multiple object tracking, image categorization, fine-grained visual categorization, overlapped 10-1, overlapped 15-1, overlapped 15-5, disjoint 10-1, disjoint 15-1.

Burst Image Super-Resolution

Stereo image super-resolution, satellite image super-resolution, multispectral image super-resolution, color constancy.

Few-Shot Camera-Adaptive Color Constancy

Hdr reconstruction, multi-exposure image fusion, open vocabulary semantic segmentation, zero-guidance segmentation, physics-informed machine learning, soil moisture estimation, deep attention, line detection, video reconstruction.

Zero Shot Segmentation

Visual recognition.

Fine-Grained Visual Recognition

Image cropping, sign language translation.

Stereo Matching Hand

3D Absolute Human Pose Estimation

Text-to-Face Generation

Image forensics, tone mapping, zero-shot action recognition, natural language transduction, video restoration.

Analog Video Restoration

Novel class discovery.

Transparent Object Detection

Transparent objects, surface normals estimation.

hand-object pose

Grasp Generation

3D Canonical Hand Pose Estimation

Breast cancer histology image classification (20% labels), cross-domain few-shot learning, texture classification, vision-language navigation.

Abnormal Event Detection In Video

Semi-supervised Anomaly Detection

Infrared and visible image fusion.

Image Animation

Image to 3D

Probabilistic deep learning, unsupervised few-shot image classification, generalized few-shot classification, pedestrian attribute recognition.

Steganalysis

Sketch Recognition

Face Sketch Synthesis

Drawing pictures.

Photo-To-Caricature Translation

Spoof detection, face presentation attack detection, detecting image manipulation, cross-domain iris presentation attack detection, finger dorsal image spoof detection, computer vision techniques adopted in 3d cryogenic electron microscopy, single particle analysis, cryogenic electron tomography, highlight detection, iris recognition, pupil dilation, action quality assessment.

One-shot visual object segmentation

Unbiased Scene Graph Generation

Panoptic Scene Graph Generation

Image to video generation.

Unconditional Video Generation

Automatic post-editing.

Dense Captioning

Image stitching.

Multi-View 3D Reconstruction

Universal domain adaptation, action understanding, blind face restoration.

Document Image Classification

Face Reenactment

Geometric Matching

Human action generation.

Action Generation

Object categorization, person retrieval, text based person retrieval, surgical phase recognition, online surgical phase recognition, offline surgical phase recognition, human dynamics.

3D Human Dynamics

Meme classification, hateful meme classification, severity prediction, intubation support prediction, cloud detection.

Text-To-Image

Story visualization, complex scene breaking and synthesis, diffusion personalization.

Diffusion Personalization Tuning Free

Efficient Diffusion Personalization

Image fusion, pansharpening, image deconvolution.

Image Outpainting

Object Segmentation

Camouflaged Object Segmentation

Landslide segmentation, text-line extraction, point clouds, point cloud video understanding, point cloud rrepresentation learning.

Semantic SLAM

Object SLAM

Intrinsic image decomposition, line segment detection, table recognition, situation recognition, grounded situation recognition, motion detection, multi-target domain adaptation, sports analytics.

Robot Pose Estimation

Camouflaged Object Segmentation with a Single Task-generic Prompt

Image morphing, image shadow removal, person identification, visual prompt tuning, weakly-supervised instance segmentation, image smoothing, fake image detection.

GAN image forensics

Fake Image Attribution

Image steganography, rotated mnist, contour detection.

Face Image Quality

Lane detection.

3D Lane Detection

Layout design, license plate detection.

Video Panoptic Segmentation

Viewpoint estimation.

Drone navigation

Drone-view target localization, value prediction, body mass index (bmi) prediction, multi-object tracking and segmentation.

Occlusion Handling

Zero-shot transfer image classification.

3D Object Reconstruction From A Single Image

CAD Reconstruction

3d point cloud linear classification, crop classification, crop yield prediction, photo retouching, motion retargeting, shape representation of 3d point clouds, bird's-eye view semantic segmentation.

Dense Pixel Correspondence Estimation

Human part segmentation.

Multiview Learning

Person recognition.

Document Shadow Removal

Symmetry detection, traffic sign detection, video style transfer, referring image matting.

Referring Image Matting (Expression-based)

Referring Image Matting (Keyword-based)

Referring Image Matting (RefMatte-RW100)

Referring image matting (prompt-based), human interaction recognition, one-shot 3d action recognition, mutual gaze, affordance detection.

Gaze Prediction

Image forgery detection, image instance retrieval, amodal instance segmentation, image quality estimation.

Image Similarity Search

Precipitation Forecasting

Referring expression generation, road damage detection.

Space-time Video Super-resolution

Video matting.

Open-World Semi-Supervised Learning

Semi-supervised image classification (cold start), hand detection, material classification.

Open Vocabulary Attribute Detection

Inverse tone mapping, image/document clustering, self-organized clustering, instance search.

Audio Fingerprint

3d shape modeling.

Action Analysis

Facial editing.

Food Recognition

Holdout Set

Motion magnification, semi-supervised instance segmentation, binary classification, llm-generated text detection, cancer-no cancer per breast classification, cancer-no cancer per image classification, suspicous (birads 4,5)-no suspicous (birads 1,2,3) per image classification, cancer-no cancer per view classification, video segmentation, camera shot boundary detection, open-vocabulary video segmentation, open-world video segmentation, lung nodule classification, lung nodule 3d classification, lung nodule detection, lung nodule 3d detection, 3d scene reconstruction, art analysis.

Zero-Shot Composed Image Retrieval (ZS-CIR)

Event segmentation, generic event boundary detection, image retouching, image-variation, jpeg artifact removal, multispectral object detection, point cloud super resolution, skills assessment.

Sensor Modeling

Video prediction, earth surface forecasting, predict future video frames, ad-hoc video search, audio-visual synchronization, handwriting generation, pose retrieval, scanpath prediction, scene change detection.

Sketch-to-Image Translation

Skills evaluation, synthetic image detection, highlight removal, 3d shape reconstruction from a single 2d image.

Shape from Texture

Deception detection, deception detection in videos, handwriting verification, bangla spelling error correction, 3d open-vocabulary instance segmentation.

3D Shape Representation

3D Dense Shape Correspondence

Birds eye view object detection.

Multiple People Tracking

Network Interpretation

Rgb-d reconstruction, seeing beyond the visible, semi-supervised domain generalization, unsupervised semantic segmentation.

Unsupervised Semantic Segmentation with Language-image Pre-training

Multiple object tracking with transformer.

Multiple Object Track and Segmentation

Constrained lip-synchronization, face dubbing, vietnamese visual question answering, explanatory visual question answering.

Video Visual Relation Detection

Human-object relationship detection, 3d shape reconstruction, defocus blur detection, event data classification, image comprehension, image manipulation localization, instance shadow detection, kinship verification, medical image enhancement, open vocabulary panoptic segmentation, single-object discovery, training-free 3d point cloud classification, video forensics.

Sequential Place Recognition

Autonomous flight (dense forest), autonomous web navigation.

Generative 3D Object Classification

Cube engraving classification, multimodal machine translation.

Face to Face Translation

Multimodal lexical translation, 10-shot image generation, 2d semantic segmentation task 3 (25 classes), document enhancement, 4d panoptic segmentation, action assessment, bokeh effect rendering, drivable area detection, face anonymization, font recognition, horizon line estimation, image imputation.

Long Video Retrieval (Background Removed)

Medical image denoising.

Occlusion Estimation

Physiological computing.

Lake Ice Monitoring

Short-term object interaction anticipation, spatio-temporal video grounding, unsupervised 3d point cloud linear evaluation, wireframe parsing, single-image-generation, unsupervised anomaly detection with specified settings -- 30% anomaly, root cause ranking, anomaly detection at 30% anomaly, anomaly detection at various anomaly percentages.

Unsupervised Contextual Anomaly Detection

2d pose estimation, category-agnostic pose estimation, overlapping pose estimation, facial expression recognition, cross-domain facial expression recognition, zero-shot facial expression recognition, landmark tracking, muscle tendon junction identification, 3d object captioning, animated gif generation, generalized referring expression comprehension, image deblocking, infrared image super-resolution, motion disentanglement, persuasion strategies, scene text editing, traffic accident detection, accident anticipation, unsupervised landmark detection, visual speech recognition, lip to speech synthesis, continual anomaly detection, gaze redirection, weakly supervised action segmentation (transcript), weakly supervised action segmentation (action set)), calving front delineation in synthetic aperture radar imagery, calving front delineation in synthetic aperture radar imagery with fixed training amount.

Handwritten Line Segmentation

Handwritten word segmentation.

General Action Video Anomaly Detection

Physical video anomaly detection, monocular cross-view road scene parsing(road), monocular cross-view road scene parsing(vehicle).

Transparent Object Depth Estimation

3d semantic occupancy prediction, 3d scene editing, age and gender estimation, data ablation.

Occluded Face Detection

Gait identification, historical color image dating, stochastic human motion prediction, image retargeting, image and video forgery detection, motion captioning, personality trait recognition, personalized segmentation, scene-aware dialogue, spatial relation recognition, spatial token mixer, steganographics, story continuation.

Unsupervised Anomaly Detection with Specified Settings -- 0.1% anomaly

Unsupervised anomaly detection with specified settings -- 1% anomaly, unsupervised anomaly detection with specified settings -- 10% anomaly, unsupervised anomaly detection with specified settings -- 20% anomaly, vehicle speed estimation, visual analogies, visual social relationship recognition, zero-shot text-to-video generation, text-guided-generation, video frame interpolation, 3d video frame interpolation, unsupervised video frame interpolation.

eXtreme-Video-Frame-Interpolation

Continual semantic segmentation, overlapped 5-3, overlapped 25-25, evolving domain generalization, source-free domain generalization, micro-expression generation, micro-expression generation (megc2021), mistake detection, online mistake detection, period estimation, art period estimation (544 artists), unsupervised panoptic segmentation, unsupervised zero-shot panoptic segmentation, 3d rotation estimation, camera auto-calibration, defocus estimation, derendering, fingertip detection, hierarchical text segmentation, human-object interaction concept discovery.

One-Shot Face Stylization

Speaker-specific lip to speech synthesis, multi-person pose estimation, neural stylization.

Part-aware Panoptic Segmentation

Population Mapping

Pornography detection, prediction of occupancy grid maps, raw reconstruction, repetitive action counting, svbrdf estimation, semi-supervised video classification, spectrum cartography, supervised image retrieval, synthetic image attribution, training-free 3d part segmentation, unsupervised image decomposition, video propagation, vietnamese multimodal learning, weakly supervised 3d point cloud segmentation, weakly-supervised panoptic segmentation, drone-based object tracking, brain visual reconstruction, brain visual reconstruction from fmri.

Human-Object Interaction Generation

Image-guided composition, fashion understanding, semi-supervised fashion compatibility.

intensity image denoising

Lifetime image denoising, observation completion, active observation completion, boundary grounding.

Video Narrative Grounding

3d inpainting, 3d scene graph alignment, 4d spatio temporal semantic segmentation.

Age Estimation

Few-shot Age Estimation

Brdf estimation, camouflage segmentation, clothing attribute recognition, damaged building detection, depth image estimation, detecting shadows, dynamic texture recognition.

Disguised Face Verification

Few shot open set object detection, gaze target estimation, generalized zero-shot learning - unseen, hd semantic map learning, human-object interaction anticipation, image deep networks, keypoint detection and image matching, manufacturing quality control, materials imaging, micro-gesture recognition, multi-person pose estimation and tracking.

Multi-modal image segmentation

Multi-object discovery, neural radiance caching.

Parking Space Occupancy

Partial Video Copy Detection

Multimodal Patch Matching

Perpetual view generation, procedure learning, prompt-driven zero-shot domain adaptation, single-shot hdr reconstruction, on-the-fly sketch based image retrieval, thermal image denoising, trademark retrieval, unsupervised instance segmentation, unsupervised zero-shot instance segmentation, vehicle key-point and orientation estimation.

Video Individual Counting

Video-adverb retrieval (unseen compositions), video-to-image affordance grounding.

Vietnamese Scene Text

Visual sentiment prediction, human-scene contact detection, localization in video forgery, 3d canonicalization, 3d surface generation.

Visibility Estimation from Point Cloud

Amodal layout estimation, blink estimation, camera absolute pose regression, change data generation, constrained diffeomorphic image registration, continuous affect estimation, deep feature inversion, document image skew estimation, earthquake prediction, fashion compatibility learning.

Displaced People Recognition

Finger vein recognition, flooded building segmentation.

Future Hand Prediction

Generative temporal nursing, grounded multimodal named entity recognition, house generation, human fmri response prediction, hurricane forecasting, ifc entity classification, image declipping, image similarity detection.

Image Text Removal

Image-to-gps verification.

Image-based Automatic Meter Reading

Dial meter reading, indoor scene reconstruction, jpeg decompression.

Kiss Detection

Laminar-turbulent flow localisation.

Landmark Recognition

Brain landmark detection, corpus video moment retrieval, mllm evaluation: aesthetics, medical image deblurring, mental workload estimation, meter reading, motion expressions guided video segmentation, natural image orientation angle detection, multi-object colocalization, multilingual text-to-image generation, video emotion detection, nwp post-processing, occluded 3d object symmetry detection, open set video captioning, pso-convnets dynamics 1, pso-convnets dynamics 2, partial point cloud matching.

Partially View-aligned Multi-view Learning

Pedestrian Detection

Thermal Infrared Pedestrian Detection

Personality trait recognition by face, physical attribute prediction, point cloud semantic completion, point cloud classification dataset, point- of-no-return (pnr) temporal localization, pose contrastive learning, potrait generation, prostate zones segmentation, pulmorary vessel segmentation, pulmonary artery–vein classification, reference expression generation, safety perception recognition, jersey number recognition, interspecies facial keypoint transfer, specular reflection mitigation, specular segmentation, state change object detection, surface normals estimation from point clouds, train ego-path detection.

Transform A Video Into A Comics

Transparency separation, typeface completion.

Unbalanced Segmentation

Unsupervised Long Term Person Re-Identification

Video correspondence flow.

Key-Frame-based Video Super-Resolution (K = 15)

Zero-shot single object tracking, yield mapping in apple orchards, lidar absolute pose regression, opd: single-view 3d openable part detection, self-supervised scene text recognition, spatial-aware image editing, video narration captioning, spectral estimation, spectral estimation from a single rgb image, 3d prostate segmentation, aggregate xview3 metric, atomic action recognition, composite action recognition, calving front delineation from synthetic aperture radar imagery, computer vision transduction, crosslingual text-to-image generation, zero-shot dense video captioning, document to image conversion, frame duplication detection, geometrical view, hyperview challenge.

Image Operation Chain Detection

Kinematic based workflow recognition, logo recognition.

MLLM Aesthetic Evaluation

Motion detection in non-stationary scenes, open-set video tagging, satellite orbit determination.

Segmentation Based Workflow Recognition

2d particle picking, small object detection.

Rice Grain Disease Detection

Sperm morphology classification, video & kinematic base workflow recognition, video based workflow recognition, video, kinematic & segmentation base workflow recognition, animal pose estimation.

DSpace@MIT Home
MIT Libraries
Doctoral Theses

Learning to solve problems in computer vision with synthetic data

Other Contributors

UNH Library

University of New Hampshire Scholars' Repository

< Previous

Home > STUDENT > THESIS > 1346

Master's Theses and Capstones

The application of computer vision, machine and deep learning algorithms utilizing matlab.

Andrea Linda Murphy , University of New Hampshire, Durham

Date of Award

Spring 2020

Project Type

Program or major.

Information Technology

Degree Name

Master of Science

First Advisor

Mihaela Sabin

Second Advisor

Third advisor.

Jeremiah Johnson

MATLAB is a multi-paradigm proprietary programming language and numerical computing environment developed by MathWorks. Within MATLAB Integrated Development Environment (IDE) you can perform Computer-aided design (CAD), different matrix manipulations, plotting of functions and data, implementation algorithms, creation of user interfaces, and has the ability to interface with programs written in other languages1. Since, its launch in 1984 MATLAB software has not particularly been associated within the field of data science. In 2013, that changed with the launch of their new data science concentrated toolboxes that included Deep Learning, Image Processing, Computer Vision, and then a year later Statistics and Machine Learning.

The main objective of my thesis was to research and explore the field of data science. More specifically pertaining to the development of an object recognition application that could be built entirely using MATLAB IDE and have a positive social impact on the deaf community. And in doing so, answering the question, could MATLAB be utilized for development of this type of application? To simultaneously answer this question while addressing my main objectives, I constructed two different object recognition protocols utilizing MATLAB_R2019 with the add-on data science tool packages. I named the protocols ASLtranslate (I) and (II). This allowed me to experiment with all of MATLAB data science toolboxes while learning the differences, benefits, and disadvantages of using multiple approaches to the same problem.

The methods and approaches for the design of both versions was very similar. ASLtranslate takes in 2D image of American Sign Language (ASL) hand gestures as an input, classifies the image and then outputs its corresponding alphabet character. ASLtranslate (I) was an implementation of image category classification using machine learning methods. ASLtranslate (II) was implemented by using a deep learning method called transfer learning, done by fine-tuning a pre-trained convolutional neural network (CNN), AlexNet, to perform classification on a new collection of images.

Recommended Citation

Murphy, Andrea Linda, "THE APPLICATION OF COMPUTER VISION, MACHINE AND DEEP LEARNING ALGORITHMS UTILIZING MATLAB" (2020). Master's Theses and Capstones . 1346. https://scholars.unh.edu/thesis/1346

Since June 24, 2020

Advanced Search

Notify me via email or RSS
Collections
Disciplines

Contributors

Submit Research

Home | About | FAQ | My Account | Accessibility Statement

Privacy Copyright

Zur Metanavigation
Zur Hauptnavigation
Zur Subnavigation
Zum Seitenfuss

If you are a student at UHH interested in a thesis with our group, you can contact Dr. Christian Wilms (B.Sc. theses) or Dr. Ehsan Yaghoubi (M.Sc. theses). Below is a list with selected titles of B.Sc. and M.Sc. theses completed in our group to orient you towards potential topics. You can also see our recent papers for possible directions for a thesis.

Selection of titles of complete Master theses: - A Deep Learning Approach for Top-down Attention with Attribute Preference - Salient object detection with AttentionMask - 3D Segmentation in the Context of Inscriptions - Active Visual Object Search Using Reinforcement Learning - Saliency-Guided Sign Language Recognition - Object Discovery in 3D Scenes via Shape Analysis using Adapted PCLV - Learning Efficient Deep Feature Representations for Indoor Visual Positioning

Selection of titles of completed Bachelor theses: - Weakly Supervised Object Detection in RoboCup Scenarios - Lokalisierung von Flugzeug-Leitwerken - IoU Predictions for Segmentation Mask Propsals - Klassifikation von malignen Melanomen mittels konditionierten ConvNets - Segmentierung von Schienenbildern zur automatisierten Wartung - Bildklassifikation von Flugzeugtypen - Segmentation of numerical weather prediction data for characterization of atmospheric airmasses - Object Detection in Remote Sensing Image Data Using AttentionMask - TileAttention: Entdeckung sehr kleiner Objekte - Superpixel Pooling for Instance Segmentation

Necessary requirements: You should have some pre-knowledge in computer vision before starting a thesis. For a B.Sc. thesis you should have at least attended the lecture "Einführung in die Bildverarbeitung", the lab course "Praktikum Computer Vision", or have equivalent knowledge. For a M.Sc. thesis, you should have attended the lectures "Computer Vision 1" and "Computer Vision 2", and ideally also the master project, or have equivalent knowledge. Pre-knowledge in machine learning is also helpful.

Press Enter to activate screen reader mode.

Computer Vision Lab

Doctoral theses.

Bibliography
More Referencing guides Blog Automated transliteration Relevant bibliographies by topics
Automated transliteration
Relevant bibliographies by topics
Referencing guides

Dissertations / Theses on the topic 'Computer Science. Computer vision'

Create a spot-on reference in apa, mla, chicago, harvard, and other styles.

Consult the top 50 dissertations / theses for your research on the topic 'Computer Science. Computer vision.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

Purdy, Eric. "Grammatical methods in computer vision." Thesis, The University of Chicago, 2013. http://pqdtopen.proquest.com/#viewpdf?dispub=3557428.

In computer vision, grammatical models are models that represent objects hierarchically as compositions of sub-objects. This allows us to specify rich object models in a standard Bayesian probabilistic framework. In this thesis, we formulate shape grammars, a probabilistic model of curve formation that allows for both continuous variation and structural variation. We derive an EM-based training algorithm for shape grammars. We demonstrate the effectiveness of shape grammars for modeling human silhouettes, and also demonstrate their effectiveness in classifying curves by shape. We also give a general method for heuristically speeding up a large class of dynamic programming algorithms. We provide a general framework for discussing coarse-to-fine search strategies, and provide proofs of correctness. Our method can also be used with inadmissible heuristics.

Finally, we give an algorithm for doing approximate context-free parsing of long strings in linear time. We define a notion of approximate parsing in terms of restricted families of decompositions, and construct small families which can approximate arbitrary parses.

Viloria, John A. (John Alexander) 1978. "Optimizing clustering algorithms for computer vision." Thesis, Massachusetts Institute of Technology, 2001. http://hdl.handle.net/1721.1/86847.

Zhan, Beibei. "Learning crowd dynamics using computer vision." Thesis, Kingston University, 2008. http://eprints.kingston.ac.uk/20302/.

Lee, Stefan. "Data-driven computer vision for science and the humanities." Thesis, Indiana University, 2016. http://pqdtopen.proquest.com/#viewpdf?dispub=10153534.

The rate at which humanity is producing visual data from both large-scale scientific imaging and consumer photography has been greatly accelerating in the past decade. This thesis is motivated by the hypothesis that this trend will necessarily change the face of observational science and the humanities, requiring the development of automated methods capable of distilling vast image collections to produce meaningful analyses. Such methods are needed to empower novel science both by improving throughput in traditionally quantitative disciplines and by developing new techniques to study culture through large scale image datasets.

When computer vision or machine learning in general is leveraged to aid academic inquiry, it is important to consider the impact of erroneous solutions produced by implicit ambiguity or model approximations. To that end, we argue for the importance of algorithms that are capable of generating multiple solutions and producing measures of confidence. In addition to providing solutions to a number of multi-disciplinary problems, this thesis develops techniques to address these overarching themes of confidence estimation and solution diversity.

This thesis investigates a diverse set of problems across a broad range of studies including glaciology, developmental psychology, architectural history, and demography to develop and adapt computer vision algorithms to solve these domain-specific applications. We begin by proposing vision techniques for automatically analyzing aerial radar imagery of polar ice sheets while simultaneously providing glaciologists with point-wise estimates of solution confidence. We then move to psychology, introducing novel recognition techniques to produce robust hand localizations and segmentations in egocentric video to empower psychologists studying child development with automated annotations of grasping behaviors integral to learning. We then investigate novel large-scale analysis for architectural history, leveraging tens of thousands of publicly available images to identify and track distinctive architectural elements. Finally, we show how rich estimates of demographic and geographic properties can be predicted from a single photograph.

Wakefield, Jonathan P. "A framework for generic computer vision." Thesis, University of Huddersfield, 1994. http://eprints.hud.ac.uk/id/eprint/4003/.

Herrera, Acuna Raul. "Advanced computer vision-based human computer interaction for entertainment and software development." Thesis, Kingston University, 2014. http://eprints.kingston.ac.uk/29884/.

Matuszewski, Damian Janusz. "Computer vision for continuous plankton monitoring." Universidade de São Paulo, 2014. http://www.teses.usp.br/teses/disponiveis/45/45134/tde-24042014-150825/.

Raufdeen, Ramzi A. "SE4S toolkit extension project vision diagramming tool build your vision." Thesis, California State University, Long Beach, 2016. http://pqdtopen.proquest.com/#viewpdf?dispub=10147325.

Sustainability is an important topic when developing software because it helps develop ecofriendly programs. Software can contribute towards sustainability by supporting sustainable goals, which can be efficiently supported if considered early on in a project by requirements engineers. This project helps requirements engineers make that sustainable contribution through the development of the SE4S toolkit extension project–a vision diagramming tool that contributes towards sustainability. This interactive tool is developed using HTML, SVG, and JointJS library. The vision diagramming tool is an open source project that can be used in any browser, which allows requirements engineers to bring their visions to life while keeping sustainability in mind. Requirements engineers, with help from this tool, would be able to easily demonstrate their sustainability vision to their stakeholders and pass it on to rest of the development team.

Wheeler, Kim M. (Kim Margaret). "A computer vision system for tracking proliferating cells /." Thesis, McGill University, 1992. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=61316.

Chiu, Kevin (Kevin Geeyoung). "Vision on tap : an online computer vision toolkit." Thesis, Massachusetts Institute of Technology, 2011. http://hdl.handle.net/1721.1/67714.

Chang, Jason Ph D. Massachusetts Institute of Technology. "Sampling in computer vision and Bayesian nonparametric mixtures." Thesis, Massachusetts Institute of Technology, 2014. http://hdl.handle.net/1721.1/91042.

Stoddart, Evan. "Computer Vision Techniques for Automotive Perception Systems." The Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu1555357244145006.

Dandois, Jonathan P. "Remote sensing of vegetation structure using computer vision." Thesis, University of Maryland, Baltimore County, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=3637314.

High-spatial resolution measurements of vegetation structure are needed for improving understanding of ecosystem carbon, water and nutrient dynamics, the response of ecosystems to a changing climate, and for biodiversity mapping and conservation, among many research areas. Our ability to make such measurements has been greatly enhanced by continuing developments in remote sensing technology—allowing researchers the ability to measure numerous forest traits at varying spatial and temporal scales and over large spatial extents with minimal to no field work, which is costly for large spatial areas or logistically difficult in some locations. Despite these advances, there remain several research challenges related to the methods by which three-dimensional (3D) and spectral datasets are joined (remote sensing fusion) and the availability and portability of systems for frequent data collections at small scale sampling locations. Recent advances in the areas of computer vision structure from motion (SFM) and consumer unmanned aerial systems (UAS) offer the potential to address these challenges by enabling repeatable measurements of vegetation structural and spectral traits at the scale of individual trees. However, the potential advances offered by computer vision remote sensing also present unique challenges and questions that need to be addressed before this approach can be used to improve understanding of forest ecosystems. For computer vision remote sensing to be a valuable tool for studying forests, bounding information about the characteristics of the data produced by the system will help researchers understand and interpret results in the context of the forest being studied and of other remote sensing techniques. This research advances understanding of how forest canopy and tree 3D structure and color are accurately measured by a relatively low-cost and portable computer vision personal remote sensing system: 'Ecosynth'. Recommendations are made for optimal conditions under which forest structure measurements should be obtained with UAS-SFM remote sensing. Ultimately remote sensing of vegetation by computer vision offers the potential to provide an 'ecologist's eye view', capturing not only canopy 3D and spectral properties, but also seeing the trees in the forest and the leaves on the trees.

Hopkins, David. "A computer vision approach to classification of circulating tumor cells." Thesis, Colorado State University, 2013. http://pqdtopen.proquest.com/#viewpdf?dispub=1539638.

Current research into the detection and characterization of circulating tumor cells (CTCs) in the bloodstream can be used to assess the threat to a potential cancer victim. We have determined specific goals to further the understanding of these cells. 1) Full automation of an algorithm to overcome the current methods challenges of being labor-intensive and time-consuming, 2) Detection of single CTC cells amongst several million white blood cells given digital imagery of a panel of blood, and 3) Objective classification of white blood cells, CTCs, and potential sub-types.

We demonstrate in this paper the developed theory, code and implementation necessary for addressing these goals using mathematics and computer vision techniques. These include: 1) Formation of a completely data-driven methodology, and 2) Use of Bag of Features computer vision technique coupled with custom-built pixel-centric feature descriptors, 3) Use of clustering techniques such as K -means and Hierarchical clustering as a robust classification method to glean insights into cell characteristics.

To objectively determine the adequacy of our approach, we test our algorithm against three benchmarks: sensitivity/specificity in classification, nontrivial event detection, and rotational invariance. The algorithm performed well with the first two, and we provide possible modifications to improve performance on the third. The results of the sensitivity and specificity benchmark are important. The unfiltered data we used to test our algorithm were images of blood panels containing 44,914 WBCs and 39 CTCs. The algorithm classified 67.5 percent of CTCs into an outlier cluster containing only 300 cells. A simple modification brought the classification rate up to 80 percent of total CTCs. This modification brings the cluster count to only 400 cells. This is a significant reduction in cells a pathologist would sort through as it is only .9 percent of the total data.

Simek, Kyle. "Branching Gaussian Process Models for Computer Vision." Diss., The University of Arizona, 2016. http://hdl.handle.net/10150/612094.

Stewart, Kendall Lee. "The Performance of Random Prototypes in Hierarchical Models of Vision." Thesis, Portland State University, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=1605894.

I investigate properties of HMAX, a computational model of hierarchical processing in the primate visual cortex. High-level cortical neurons have been shown to respond highly to particular natural shapes, such as faces. HMAX models this property with a dictionary of natural shapes, called prototypes, that respond to the presence of those shapes. The resulting set of similarity measurements is an effective descriptor for classifying images. Curiously, prior work has shown that replacing the dictionary of natural shapes with entirely random prototypes has little impact on classification performance. This work explores that phenomenon by studying the performance of random prototypes on natural scenes, and by comparing their performance to that of sparse random projections of low-level image features.

Chen, Bikui 1973. "An application of nonlinear resistive networks in computer vision." Thesis, Massachusetts Institute of Technology, 2000. http://hdl.handle.net/1721.1/88346.

Chen, Ke. "Latent dependency mining for solving regression problems in computer vision." Thesis, Queen Mary, University of London, 2013. http://qmro.qmul.ac.uk/xmlui/handle/123456789/8402.

Lin, Xinze. "Vision-based Tracking for Intuitive Interaction on Webpages." Thesis, University of Gävle, Department of Industrial Development, IT and Land Management, 2010. http://urn.kb.se/resolve?urn=urn:nbn:se:hig:diva-7088.

Vision-based tracking technique provides intuitive experience for users to interact with their computers. However, such technique seldom appears on web applications that can be widely accessed by average users. The purpose of this paper is to suggest a technique that will enable more users to engage in the intuitive interaction on the Web. In order to achieve this, a vision based tracking system is built. The tracking system can be used by different types of web applications, so that a wide range of users can access this technique.

Biswas, Aritro. "Using computer vision to gain health insights from social media." Thesis, Massachusetts Institute of Technology, 2018. http://hdl.handle.net/1721.1/119746.

Yang, Woodward. "The architecture and design of CCD processors for computer vision." Thesis, Massachusetts Institute of Technology, 1990. http://hdl.handle.net/1721.1/13581.

Liu, Tony J. "A real-time computer vision library for heterogeneous processing environments." Thesis, Massachusetts Institute of Technology, 2011. http://hdl.handle.net/1721.1/66439.

Jaroensri, Ronnachai. "Learning to solve problems in computer vision with synthetic data." Thesis, Massachusetts Institute of Technology, 2019. https://hdl.handle.net/1721.1/122560.

Vondrick, Carl (Carl Martin). "Predictive vision." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/112001.

Stich, Melanie Katherine. "Computer vision for dual spacecraft proximity operations -- A feasibility study." Thesis, University of California, Davis, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=1604072.

A computer vision-based navigation feasibility study consisting of two navigation algorithms is presented to determine whether computer vision can be used to safely navigate a small semi-autonomous inspection satellite in proximity to the International Space Station. Using stereoscopic image-sensors and computer vision, the relative attitude determination and the relative distance determination algorithms estimate the inspection satellite's relative position in relation to its host spacecraft. An algorithm needed to calibrate the stereo camera system is presented, and this calibration method is discussed. These relative navigation algorithms are tested in NASA Johnson Space Center's simulation software, Engineering Dynamic On-board Ubiquitous Graphics (DOUG) Graphics for Exploration (EDGE), using a rendered model of the International Space Station to serve as the host spacecraft. Both vision-based algorithms proved to attain successful results, and the recommended future work is discussed.

Rondahl, Thomas. "Face Detection in Digital Imagery Using Computer Vision and Image Processing." Thesis, Umeå universitet, Institutionen för datavetenskap, 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-51406.

Bång, Filip. "Computer vision as a tool for forestry." Thesis, Linnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-85214.

Lam, Yiu Man. "Self-organized cortical map formation by guiding connections /." View abstract or full-text, 2004. http://library.ust.hk/cgi/db/thesis.pl?ELEC%202005%20LAM.

Cielniak, Grzegorz. "People Tracking by Mobile Robots using Thermal and Colour Vision." Doctoral thesis, Örebro : Örebro universitetsbibliotek, 2007. http://urn.kb.se/resolve?urn=urn:nbn:se:oru:diva-1111.

Genc, Serkan. "Vision-based Hand Interface Systems In Human Computer Interaction." Phd thesis, METU, 2010. http://etd.lib.metu.edu.tr/upload/12611700/index.pdf.

Thomure, Michael David. "The Role of Prototype Learning in Hierarchical Models of Vision." PDXScholar, 2014. https://pdxscholar.library.pdx.edu/open_access_etds/1665.

Javadi, Mohammad Saleh. "Computer Vision Algorithms for Intelligent Transportation Systems Applications." Licentiate thesis, Blekinge Tekniska Högskola, Institutionen för matematik och naturvetenskap, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-17166.

Kraft, Adam Davis. "Vision by alignment." Thesis, Massachusetts Institute of Technology, 2018. http://hdl.handle.net/1721.1/115632.

Grech, Raphael. "Multi-robot vision." Thesis, Kingston University, 2013. http://eprints.kingston.ac.uk/27790/.

Hyman, Jacob A. (Jacob Andrew) 1980. "Computer vision based people tracking for motivating behavior in public spaces." Thesis, Massachusetts Institute of Technology, 2003. http://hdl.handle.net/1721.1/28465.

Gorantla, Lakshmi Anjana Devi. "Development of a Computer Vision and Image Processing Toolbox for Matlab." Thesis, Southern Illinois University at Edwardsville, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=10844337.

Image Processing and Computer Vision (CVIP) applications can be developed and analyzed using the CVIPtools software developed at Southern Illinois University Edwardsville in the CVIP Laboratory under the guidance of Dr. Scott E Umbaugh. The CVIPtools software has been created with the code in the C/C++/C# programming languages. Due to the popularity in engineering applications for Matlab use it was decided to port the CVIPtools libraries functions to Matlab M-files and create a CVIP Toolbox for Matlab.

This work consists of developing, testing, packaging, developing documentation for, and releasing the first version of the Matlab Computer Vision and Image Processing Toolbox. In this there are several steps involved which are described clearly in this research work. The primary aim of thesis work is to create a toolbox which is independent of any other toolboxes in Matlab. CVIPtools has over 200 functions which are written in C, but due to growing demand for Matlab we decided to make the functions available in Matlab. After the toolbox is created, the user can install it and can use the functions in the toolbox as Matlab inbuilt functions. This will make it easy for the user to understand and experiment with different CVIP algorithms.

Initially the toolbox was created writing wrapper functions for the programs written in C through the creation of MEX functions. But later due to problems during testing, it was determined [5] that it would be more suitable to write separate Matlab code, M-files for all the functions and create new toolbox.

The CVIP Toolbox for Matlab is an open source project and is independent of any other toolboxes. Thus, the user can install the toolbox and can use all the functions as Matlab inbuilt functions without the need to purchase any of the other Matlab toolboxes, which is required for other toolboxes of this type. There are 206 functions in this first version of toolbox which are the primary functions for CVIP applications. These functions are arranged according to categories so that it will be easy for the user to understand and search various functions.

The CVIP Toolbox is organized into several folders including CVIP Lab, which allows the user to create any algorithm with the help of functions available in the toolbox. The user can explore by using different functions in the toolbox and varying parameters experimentally to achieve desired results. The skeleton program for lab is in cviplab.m which has a sample function implemented so that the user can see how the sample is executed and can call other functions using the same method.

White, Raymond Gordon. "Effect of point-spread functions on geometric measurements in computer vision." Diss., The University of Arizona, 1991. http://hdl.handle.net/10150/185463.

Lai, Bing-Chang. "Combining generic programming with vector processing for machine vision." Access electronically, 2005. http://www.library.uow.edu.au/adt-NWU/public/adt-NWU20060221.095043/index.html.

Acevedo, Feliz Daniel. "A framework for the perceptual optimization of multivalued multilayered two-dimensional scientific visualization methods." View abstract/electronic edition; access limited to Brown University users, 2008. http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqdiss&rft_dat=xri:pqdiss:3318287.

Wang, Xiaomei. "Definition and utilization of spatial relations in high level computer vision /." free to MU campus, to others for purchase, 1999. http://wwwlib.umi.com/cr/mo/fullcit?p9951133.

Naik, Nikhil (Nikhil Deepak). "Visual urban sensing : understanding cities through computer vision." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/109656.

Roth, Daniel R. (Daniel Risner) 1979. "Vision based robot navigation." Thesis, Massachusetts Institute of Technology, 2004. http://hdl.handle.net/1721.1/17978.

Chapman, David 1961. "Vision, instruction, and action." Thesis, Massachusetts Institute of Technology, 1990. http://hdl.handle.net/1721.1/17253.

Balasuriya, Sumitha. "A computational model of space-variant vision based on a self-organised artificial retina tessellation." Thesis, University of Glasgow, 2006. http://theses.gla.ac.uk/4934/.

Abusaleh, Sumaya. "A Novel Computer Vision-Based Framework for Supervised Classification of Energy Outbreak Phenomena." Thesis, University of Bridgeport, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=10746723.

Today, there is a need to implement a proper design of an adequate surveillance system that detects and categorizes explosion phenomena in order to identify the explosion risk to reduce its impact through mitigation and preparedness. This dissertation introduces state-of-the-art classification of explosion phenomena through pattern recognition techniques on color images. Consequently, we present a novel taxonomy for explosion phenomena. In particular, we demonstrate different aspects of volcanic eruptions and nuclear explosions of the proposed taxonomy that include scientific formation, real examples, existing monitoring methodologies, and their limitations. In addition, we propose a novel framework designed to categorize explosion phenomena against non-explosion phenomena. Moreover, a new dataset, Volcanic and Nuclear Explosions (VNEX), was collected. The totality of VNEX is 10, 654 samples, and it includes the following patterns: pyroclastic density currents, lava fountains, lava and tephra fallout, nuclear explosions, wildfires, fireworks, and sky clouds.

In order to achieve high reliability in the proposed explosion classification framework, we propose to employ various feature extraction approaches. Thus, we calculated the intensity levels to extract the texture features. Moreover, we utilize the YC b C r color model to calculate the amplitude features. We also employ the Radix-2 Fast Fourier Transform to compute the frequency features. Furthermore, we use the uniform local binary patterns technique to compute the histogram features. Additionally, these discriminative features were combined into a single input vector that provides valuable insight of the images, and then fed into the following classification techniques: Euclidian distance, correlation, k-nearest neighbors, one-against-one multiclass support vector machines with different kernels, and the multilayer perceptron model. Evaluation results show the design of the proposed framework is effective and robust. Furthermore, a trade-off between the computation time and the classification rate was achieved.

Vemulapalli, Raviteja. "Geometric representations and deep Gaussian conditional random field networks for computer vision." Thesis, University of Maryland, College Park, 2017. http://pqdtopen.proquest.com/#viewpdf?dispub=10192530.

Representation and context modeling are two important factors that are critical in the design of computer vision algorithms. For example, in applications such as skeleton-based human action recognition, representations that capture the 3D skeletal geometry are crucial for achieving good action recognition accuracy. However, most of the existing approaches focus mainly on the temporal modeling and classification steps of the action recognition pipeline instead of representations. Similarly, in applications such as image enhancement and semantic image segmentation, modeling the spatial context is important for achieving good performance. However, the standard deep network architectures used for these applications do not explicitly model the spatial context. In this dissertation, we focus on the representation and context modeling issues for some computer vision problems and make novel contributions by proposing new 3D geometry-based representations for recognizing human actions from skeletal sequences, and introducing Gaussian conditional random field model-based deep network architectures that explicitly model the spatial context by considering the interactions among the output variables. In addition, we also propose a kernel learning-based framework for the classification of manifold features such as linear subspaces and covariance matrices which are widely used for image set-based recognition tasks.

This dissertation has been divided into five parts. In the first part, we introduce various 3D geometry-based representations for the problem of skeleton-based human action recognition. The proposed representations, referred to as R3DG features, capture the relative 3D geometry between various body parts using 3D rigid body transformations. We model human actions as curves in these R3DG feature spaces, and perform action recognition using a combination of dynamic time warping, Fourier temporal pyramid representation and support vector machines. Experiments on several action recognition datasets show that the proposed representations perform better than many existing skeletal representations.

In the second part, we represent 3D skeletons using only the relative 3D rotations between various body parts instead of full 3D rigid body transformations. This skeletal representation is scale-invariant and belongs to a Lie group based on the special orthogonal group. We model human actions as curves in this Lie group and map these curves to the corresponding Lie algebra by combining the logarithm map with rolling maps. Using rolling maps reduces the distortions introduced in the action curves while mapping to the Lie algebra. Finally, we perform action recognition by classifying the Lie algebra curves using Fourier temporal pyramid representation and a support vector machines classifier. Experimental results show that by combining the logarithm map with rolling maps, we can get improved performance when compared to using the logarithm map alone.

In the third part, we focus on classification of manifold features such as linear subspaces and covariance matrices. We present a kernel-based extrinsic framework for the classification of manifold features and address the issue of kernel selection using multiple kernel learning. We introduce two criteria for jointly learning the kernel and the classifier by solving a single optimization problem. In the case of support vector machine classifier, we formulate the problem of learning a good kernel-classifier combination as a convex optimization problem. The proposed approach performs better than many existing methods for the classification of manifold features when applied to image set-based classification task.

In the fourth part, we propose a novel end-to-end trainable deep network architecture for image denoising based on a Gaussian Conditional Random Field (CRF) model. Contrary to existing discriminative denoising approaches, the proposed network explicitly models the input noise variance and hence is capable of handling a range of noise levels. This network consists of two sub-networks: (i) a parameter generation network that generates the Gaussian CRF pairwise potential parameters based on the input image, and (ii) an inference network whose layers perform the computations involved in an iterative Gaussian CRF inference procedure. Experiments on several images show that the proposed approach produces results on par with the state-of-the-art without training a separate network for each noise level.

In the final part of this dissertation, we propose a Gaussian CRF model-based deep network architecture for the task of semantic image segmentation. This network explicitly models the interactions between output variables which is important for structured prediction tasks such as semantic segmentation. The proposed network is composed of three sub-networks: (i) a Convolutional Neural Network (CNN) based unary network for generating the unary potentials, (ii) a CNN-based pairwise network for generating the pairwise potentials, and (iii) a Gaussian mean field inference network for performing Gaussian CRF inference. When trained end-to-end in a discriminative fashion the proposed network outperforms various CNN-based semantic segmentation approaches.

Harwath, David F. (David Frank). "Learning spoken language through vision." Thesis, Massachusetts Institute of Technology, 2018. http://hdl.handle.net/1721.1/118081.

Chin, Toshio M. "Dynamic estimation in computational vision." Thesis, Massachusetts Institute of Technology, 1992. http://hdl.handle.net/1721.1/13072.

Park, Allen S. M. (Allen S. ). Massachusetts Institute of Technology. "Machine-vision assisted 3D printing." Thesis, Massachusetts Institute of Technology, 2016. http://hdl.handle.net/1721.1/113162.

Vicenti, Jasper Fourways. "Aural imaging from 3D vision." Thesis, Massachusetts Institute of Technology, 2006. http://hdl.handle.net/1721.1/37085.

Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis

1 Columbia University

2 Stanford University

3 Toyota Research Institute

We present GCD (for G enerative C amera D olly), a framework for synthesizing large-angle novel viewpoints of dynamic scenes from a single monocular video. Specifically, given any color video, along with precise instructions on how to rotate and/or translate the camera, our model can imagine what that same scene would look like from another perspective. Much like a camera dolly in film-making, our approach essentially conceives a virtual camera that can move around freely, reveal portions of the environment that are otherwise unseen, and reconstruct hidden objects behind occlusions, all within complex dynamic scenes, even when the contents are moving. We believe our framework can potentially unlock powerful applications in rich dynamic scene understanding, perception for robotics, and interactive 3D video viewing experiences for virtual reality.

We train a neural network to predict all frames corresponding to the target viewpoint, conditioned on the input video plus relative camera pose parameters that describe the spatial relationship between the source and target extrinsics. The camera transformation is simply calculated as $ \Delta \mathcal{E} = \mathcal{E}_{src}^{-1} \cdot \mathcal{E}_{dst} $. In practice, we encode these parameters as a rotation (azimuth, elevation) and translation (radius) vector. We teach Stable Video Diffusion, a state-of-the-art diffusion model for image-to-video generation, to accept and utilize these new controls by means of finetuning.

Representative Results

Despite being trained on synthetic multi-view video data only, experiments show promising results in multiple domains, including robotics, object permanence, and driving environments. We showcase a mixture of in-domain as well as out-of-distribution (real-world) results. While zero-shot generalization is highly challenging and not the focus of our work, we demonstrate that our model can successfully tackle some of these videos.

Amodal Completion and Object Permanence

Partial and total occlusions are very common in everyday dynamic scenes. Our network is capable of inpainting the occluded parts of objects and scenes. In the two examples below, the input camera resides at a low elevation angle, such that the higher output viewpoint implies having to correctly reconstruct the objects lying further in the back. Note the paper towel roll and the brown bucket in particular.

A more advanced spatiotemporal reasoning ability is needed for objects that become completely occluded throughout the video. Our model successfully persists them in the next two examples, which is a skill known as object permanence. In the first video, both the blue duck and the red duck disappear behind a hand and teabox respectively.

In this second video, the brown shoe falling to the left is temporarily hidden by the purple pag, but the output reflects an accurate continuation of its dynamics, shape, and appearance before it reappears in the observation.

Driving Scene Completion (Color + Semantic)

In embodied AI, including for autonomous vehicles, situational awareness is paramount. In this environment, we trained our model to synthesize a top-down-and-forward perspective that can give the ego car (on which only a single RGB sensor has to be mounted) a much more complete, detailed overview of its surroundings. Note how the white car on the left and the two pedestrians on the right are still visible in the generated video, despite going out-of-frame with respect to the input camera.

Our framework is in principle capable of running any dense predictive computer vision task as long as training annotations are available. In this example, we classify every pixel from the novel viewpoint into its corresponding semantic category.

Similarly as before, we can also control the camera viewpoint here in a fine-grained fashion. The angles are chosen randomly for demonstration purposes.

The above driving scenarios are in fact synthetic (from the ParallelDomain engine) -- next, we qualitatively visualize a couple real-world results (from the TRI-DDAD dataset, which was unseen during training).

Gradual vs. Jumpy Trajectories

Gradual (top row): Linearly interpolates the camera path between the source and target viewpoints.
Jumpy (bottom row): Direct camera displacement, which synthesizes the entire video from the desired viewpoint.

Real-World Success Cases

Fine-grained camera control and extreme view synthesis.

We contribute two new multi-view video datasets for training and evaluation: Kubric-4D and ParallelDomain-4D . More details coming soon!

BibTeX Citation

More results, success cases, failure cases, video presentation.

Acknowledgements

This research is based on work partially supported by the NSF CAREER Award #2046910 and the NSF Center for Smart Streetscapes (CS3) under NSF Cooperative Agreement No. EEC-2133516. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the sponsors. The webpage template was inspired by this project page .

Palantir Wins $480 Million AI Computer Vision Deal With US Army

By Lizette Chapman

The US Army awarded a $480 million contract to Palantir Technologies Inc. for work on a project called the Maven Smart System through 2029.

The Maven deal was disclosed Wednesday as part of the Defense Department’s daily contract announcements. Palantir declined to comment on the contract.

The deal further extends Palantir’s relationship with the military, and makes use of the company’s growing suite of artificial intelligence offerings. The Maven project uses AI and computer vision to help soldiers more quickly and accurately identify targets.

Read More: AI Warfare Becomes Real for US Military With Project Maven

The technology, which Palantir ...

Learn more about Bloomberg Law or Log In to keep reading:

Learn about bloomberg law.

AI-powered legal analytics, workflow tools and premium legal & business news.

Already a subscriber?

Thesis Proposal - Mingjie Sun

May 28, 2024 1:00pm.

Location: In Person - Traffic21 Classroom, Gates Hillman 6501

Speaker: MINGJIE SUN , Ph.D. Student, Computer Science Department, Carnegie Mellon University https://eric-mingjie.github.io/

Understanding and Leveraging the Activation Landscape in Transformers

Transformer is a neural network architecture centered on the self-attention mechanism. In recent years, it has become the de-facto architecture for deep learning, e.g., Large Language Models (LLMs) and Vision Transformers (ViTs). However, these models, with millions to billions of parameters, remain largely opaque and their mechanisms are difficult to interpret. As their real-world applications grow, gaining a deep understanding of their internal representations is essential for effectively utilizing and improving these models.

In this work, we closely examine the activation landscape in Transformers. We demonstrate that understanding the intriguing activation phenomena in Transformers can have practical and meaningful implications. First, we identify a fundamental limitation of the well-established magnitude pruning method, where it fails to consider the existence of features with large activations in large-scale Transformers. Leveraging this key insight, we develop a simple and effective pruning approach. Second, we discover and study the presence of very few activations with extremely large magnitudes, which we call massive activations. We investigate the role of massive activations in Transformers and show how they are fundamentally connected to the self-attention mechanism. Last, we discuss our proposed extensions of this work, primarily focusing on developing a unified framework for LLM compression, through a principled investigation of existing works.

Thesis Committee:

J. Zico Kolter (Chair) Graham Neubig Aditi Raghunathan Kaiming He (Massachusetts Institute of Technology)

Additional Information

Add event to Google Add event to iCal

Earth Remote Sensing and Geographic Information Systems

SCIENTIFIC SCHOOL OF THE IMAGE PROCESSING SYSTEMS INSTITUTE OF THE RUSSIAN ACADEMY OF SCIENCES–BRANCH OF THE FEDERAL SCIENTIFIC RESEARCH CENTER “CRYSTALLOGRAPHY AND PHOTONICS” OF THE RUSSIAN ACADEMY OF SCIENCES, SAMARA, THE RUSSIAN FEDERATION
V.A. Soifer’s Scientific School
Published: 20 March 2024
Volume 33 , pages 1129–1141, ( 2023 )

Cite this article

V. A. Soifer 1 , 2 ,
V. V. Sergeev 1 , 2 ,
V. N. Kopenkov 1 , 2 &
A. V. Chernov 1 , 2

43 Accesses

Explore all metrics

The article examines the role and place of Earth remote sensing (ERS) in geographic information systems. The stages of development of remote sensing and geoinformatics are given, as well as a brief overview of Russian means of obtaining, receiving, and processing satellite images. The specifics and tasks of processing remote sensing data, including hyperspectral data, as well as the experience of using remote sensing data and geoinformation to solve practical problems of managing the territory of the Samara oblast are considered.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

Chernov A V, “The Earth from the space: Most efficient solutions,” Zemlya Vselennaya, No. 3, 74–81 (2012).

A. V. Chernov, Upr. Razvit. Territ., No. 4, 52 (2013).

B. A. Dvorkin and M. A. Elerdova, “Specific features of ground segments in remote sensing systems,” Geomatika, No. 3, 19–24 (2010).

M. S. Goryainov, “Extended Abstract of Candidate’s Dissertation in History,” (Vavilov Institute for the History of Science and Technology, Russian Academy of Sciences, Moscow, 2006).

M. V. Gashnikov, N. I. Glumov, E. V. Goshin, A. Yu. Denisova, A. V. Kuznetsov, V. A. Mitekin, V. V. Myasnikov, V. V. Sergeev, V. A. Soifer, V. A. Fedoseev, V. A. Fursov, M. A. Chicheva, and P. Yu. Yakimov, Perspective Information Technologies of Remote Sensing of the Earth: Monograph (Novaya Tekhnika, Samara, 2015).

Google Scholar

V. P. Glushko, Development of Rocket Engineering and Space Research in Soviet Union (Mashinostroenie, Moscow, 1981).

Yu. F. Knizhnikov, V. I. Kravtsova, and O. V. Tutubalina, Aerospace Methods of Geographic Studies: Textbook for University Students (Tsentr Akademiya, Moscow, 2004).

I. Moskalenko, ArcReview, No. 3(34) (2005).

W. G. Rees, Physical Principles of Remote Sensing , 2nd ed. (Cambridge Univ. Press, Cambridge, 2001).

Book Google Scholar

http://russianspacesystems.ru/bussines/dzz/orbitalnaya-gruppirovka-ka-dzz/.

N. S. Vorobiova, V. V. Sergeev, and A. V. Chernov, “Information technology of early crop identification by using satellite images,” Komp’yuternaya Opt. 40 , 929–938 (2016). https://doi.org/10.18287/2412-6179-2016-40-6-929-938

Article ADS Google Scholar

Download references

The work was supported by the Russian Foundation for Basic Research projects 16-29-09494-ofi-m, Methods of Computer Processing of Multispectral Earth Remote Sensing Data to Determine Plant Habitats in Special Forensic Examinations and no. 16-29-11683-ofi-m, High-Contrast Diffraction Gratings Integrated on a Crystal for Optical Information Processing Systems. The authors also express gratitude to the Russian Foundation for Basic Research for grant support for more than 20 projects carried out by our team since 1993.

Author information

Authors and affiliations.

Image Processing Systems Institute, Federal Scientific Research Center “Crystallography and Photonics” of the Russian Academy of Sciences, 443001, Samara, Russian Federation

V. A. Soifer, V. V. Sergeev, V. N. Kopenkov & A. V. Chernov

Samara University, 443080, Samara, Russian Federation

You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to V. A. Soifer , V. V. Sergeev , V. N. Kopenkov or A. V. Chernov .

Ethics declarations

The authors of this work declare that they have no conflicts of interest.

Additional information

Viktor Soifer , Doctor of Technical Sciences, Professor, Academician of the Russian Academy of Sciences, born in 1945. In 1968 he graduated from the Kuibyshev Aviation Institute (now Samara National Research University). In 1970 he defended his Candidate’s dissertation, and in 1979, his Doctoral dissertation. Since 2000, corresponding member, and since 2016, Academician of the Russian Academy of Sciences. Currently is President of the Samara National Research University, scientific director of the Institute of Image Processing Systems of the Russian Academy of Sciences, a Branch of the Federal Scientific Research Center Crystallography and Photonics. More than 700 scientific papers have been published, including 11 monographs. Member of the International Optical Society (SPIE)), member of the board of the International Association for Pattern Recognition (IAPR), Laureate of the State Prize in the field of science and technology, laureate of the Government of the Russian Federation in the field of science and technology, Laureate of the Government of the Russian Federation in the field of education, Laureate of the Government of the Russian Federation Prize named after Yu.A. Gagarin in the field of space activities, member of the scientific councils of the Russian Academy of Sciences: Cybernetics, Optical Memory and Neural Systems, Holography, member of the International Optical Society (SPIE), member of the board of the International Association for Pattern Recognition (IAPR). He is the Chairman of the Public Chamber of the Samara Region, Deputy Chairman of the Council of Rectors of Universities of the Samara oblast.

Vladislav Sergeev , born in 1951. In 1974 he graduated from the Kuibyshev Aviation Institute (now Samara National Research University). In 1978 he defended his Candidate’s dissertation, and in 1993, his Doctoral dissertation. Currently, he is the Director of the Institute of Informatics, Mathematics and Electronics, head of the Department of Geoinformatics and Information Security at Samara University, and part-time head of the laboratory of mathematical methods of image processing at the Institute of Image Processing Systems of the Russian Academy of Sciences, a Branch of the Federal Research Center Crystallography and Photonics.

Author and co-author of about 300 scientific and educational works, including four monographs, five Russian Federation patents for inventions.

Specialist in the field of digital signal processing, image analysis and pattern recognition, geoinformatics and information security. Current research interests: mathematical methods, fast computational algorithms and new information technologies of digital signal processing, image and video analysis, feature extraction for detecting and recognizing objects in images.

He is deputy chairman of the dissertation council, chairman of the Scientific and Technical Council for preliminary examination of dissertations, expert of the Russian Foundation for Basic Research, Corresponding Member of the Russian Ecological Academy, Corresponding Member of the Academy of Engineering Sciences of the Russian Federation, member of SPIE, member of the IAPR, chairman of the Volga region branch of the Russian public organization Association of Pattern Recognition and Image Analysis, member of the Editorial Board of the journals Pattern Recognition and Image Analysis and Computer Optics .

Vasiliy Kopenkov , born in 1978. In 2001 he graduated from Samara State Aerospace University (now Samara National Research University). In 2011 he defended his Cand. Sci. thesis. Currently he is an associate professor at the Department of Geoinformatics and Information Security at Samara University and has more than 60 scientific publications, 4 scientific and methodological works. Area of scientific interests: remote sensing data, reception and processing of satellite images, image analysis, geoinformatics.

Andrey Chernov , born in 1975. In 1999 he graduated from Samara State Aerospace University (now Korolev Samara National Research University). In 2004 he defended his candidate’s thesis. Currently he is the director of Volga Center for Space Geoinformatics, Deputy Director of JSC Samara-Informsputnik. He has about 100 scientific publications, 2 monographs, 12 copyright certificates, leadership or participation in 10 projects of the Russian Foundation for Basic Research and the Russian Science Foundation. Area of scientific interests: geoinformatics, digital modeling of urban environment development, urban studies. The team led by Chernov in 1998–2021 successfully implemented more than 100 projects at the regional level, introduced into the practice of state, municipal and corporate management. Social activity: one of the initiators of the Tom Sawyer Fest festival for the renovation of the historical environment, a participant in the implementation of the Samara-2025 Development Strategy.

Publisher’s Note.

Pleiades Publishing remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Soifer, V.A., Sergeev, V.V., Kopenkov, V.N. et al. Earth Remote Sensing and Geographic Information Systems. Pattern Recognit. Image Anal. 33 , 1129–1141 (2023). https://doi.org/10.1134/S1054661823040454

Download citation

Received : 23 June 2022

Revised : 23 June 2022

Accepted : 23 June 2022

Published : 20 March 2024

Issue Date : December 2023

DOI : https://doi.org/10.1134/S1054661823040454

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

remote sensing
geographic information systems
image processing
big geodata
hyperspectral equipment
Find a journal
Publish with us
Track your research

We've detected unusual activity from your computer network

To continue, please click the box below to let us know you're not a robot.

Why did this happen?

Please make sure your browser supports JavaScript and cookies and that you are not blocking them from loading. For more information you can review our Terms of Service and Cookie Policy .

For inquiries related to this message please contact our support team and provide the reference ID below.

Top Things to Do in Samara, Russia

Places to visit in samara, explore popular experiences, top attractions in samara.

Other Top Attractions around Samara

What travellers are saying

A Bronze Age Landscape in the Russian Steppes: The Samara Valley Project

Order this book here!

Print: Order from our distributor
Electronic: Kindle eBook
Open Access: PDF

Return to catalog

David w. anthony, dorcas r. brown, aleksandr a. khokhlov, pavel f. kuznetsov, and oleg d. mochalov.

“In all it is a thorough piece of reporting, efficiently compartmentalized, with all the evidence needed to support the discussion and conclusions clearly laid out. The editors and authors are to be congratulated in presenting such a range of complex data in so accessible a way.” — Barry Cunliffe, European Journal of Archaeology , 2017

Part I Introduction and Overview of the Samara Valley Project 1995–2002

Ch. 1 The Samara Valley Project and the Evolution of Pastoral Economies in the Western Eurasian Steppes by David W. Anthony
Ch. 2 Archaeological Field Operations in the Lower Samara Valley, 1995–2001, with Observations on Srubnaya Pastoralism by David W. Anthony, Dorcas R. Brown, and Pavel F. Kuznetsov

Part II History, Ecology, and Settlement Patterns in the Samara Oblast

Ch. 3 Historic Records of the Economy and Ethnic History of the Samara Region by Oleg D. Mochalov, Dmitriy V. Romanov, and David W. Anthony
Ch. 4 The Samara Valley in the Bronze Age: A Review of Archaeological Discoveries by Pavel F. Kuznetsov and Oleg D. Mochalov (Translated from Russian by David W. Anthony )
Ch. 5 Paleoecological Evidence for Vegetation, Climate, and Land-Use Change in the Lower Samara River Valley by Laura M. Popova

Part III Human Skeletal Studies

Ch. 6 Demographic and Cranial Characteristics of the Volga-Ural Population in the Eneolithic and Bronze Age by Aleksandr A. Khokhlov
Ch. 7 Stable Isotope Analysis of Neolithic to Late Bronze Age Populations in the Samara Valley by Rick J. Schulting and Michael P. Richards
Ch. 8 A Bioarchaeological Study of Prehistoric Populations from the Volga Region by Eileen M. Murphy and Aleksandr A. Khokhlov

Part IV Excavation and Specialist Reports for the Krasnosamarskoe Kurgan Cemetery and Settlement and the Herding Camps in Peschanyi Dol

Ch. 9 The Geoarchaeology of the Krasnosamarskoe Sites by Arlene Miller Rosen
Ch. 10 Excavations at the LBA Settlement at Krasnosamarskoe by David W. Anthony, Dorcas R. Brown, Pavel F. Kuznetsov, and Oleg D. Mochalov
Ch. 11 Bronze Age Metallurgy in the Middle Volga by David L. Peterson, Peter Northover, Chris Salter, Blanca Maldonado, and David W. Anthon y
Ch. 12 Floral Data Analysis: Report on the Pollen and Macrobotanical Remains from the Krasnosamarskoe Settlement by Laura M. Popova
Ch. 13 Phytoliths from the Krasnosamarskoe Settlement and Its Environment by Alison Weisskopf and Arlene Miller Rosen
Ch. 14 Dog Days of Winter: Seasonal Activities in a Srubnaya Landscape by Anne Pike-Tay and David W. Anthony
Ch. 15 Archaeozoological Report on the Animal Bones from the Krasnosamarskoe Settlement by Pavel A. Kosintsev
Ch. 16 Human-Animal Relations at Krasnosamarskoe by Nerissa Russell, Audrey Brown, and Emmett Brown
Ch. 17 The Bronze Age Kurgan Cemetery at Krasnosamarskoe IV by Pavel F. Kuznetsov, Oleg D. Mochalov, and David W. Anthony
Ch. 18 Bronze Age Herding Camps: Survey and Excavations in Peschanyi Dol by David W. Anthony, Dorcas R. Brown, Pavel F. Kuznetsov, and Oleg D. Mochalov

IMAGES

Top 25 Computer Vision Project Ideas for 2023
20+ Computer Vision Project Ideas for Beginners in 2023
(PDF) A Study on Computer Vision
Computer Vision Projects Ideas
15 Computer Vision Project Ideas for Beginners in 2021
Ensuring the Success of an ML Computer Vision Project

VIDEO

computer vision project : track point using python and opencv
Optimized UAV path planning on VREP
Xingyu Tao
Advanced Computer Vision Project With Arduino (OpenCV)
Conscious.frequencies & imagination.project track Thesis
Research Topics on Digital Image Processing

COMMENTS

Master's theses in Computer Vision
Internal Master's thesis at the Computer Vision Lab (CVL) Internal master's theses are normally connected to a research project, and explore a specific research idea. Some project suggestions are listed here: CVL Master's thesis proposal repository. If you already have an idea for a project, you may also contact one of the CVL examiners directly.
Theses
This thesis addresses the problem of training more robust and generalizable machine learning models across a wide range of learning paradigms for medical time series and computer vision tasks. The former is a typical example of a low signal-to-noise ratio data modality with a high degree of variability between subjects and datasets.
[2311.04888] Towards Few-Annotation Learning in Computer Vision
In this thesis, we develop theoretical, algorithmic and experimental contributions for Machine Learning with limited labels, and more specifically for the tasks of Image Classification and Object Detection in Computer Vision. In a first contribution, we are interested in bridging the gap between theory and practice for popular Meta-Learning algorithms used in Few-Shot Classification. We make ...
Computer Vision
4656 benchmarks • 1431 tasks • 3023 datasets • 47702 papers with code.
Thesis Projects
Thesis Projects. We constantly offer interesting and challenging semester and master projects for motivated students at our lab. Below, you can find a list of topics that are currently being offered. Not all projects might be listed, if you are generally interested, do not hesitate to contact ...
Learning to solve problems in computer vision with synthetic data
This thesis considers the use of synthetic data to allow the use of DNN to solve problems in computer vision. First, we consider using synthetic data for problems where collection of real data is not feasible. We focus on the problem of magnifying small motion in videos. Using synthetic data allows us to train DNN models that magnify motion ...
PDF DEEP LEARNING ARCHITECTURES FOR COMPUTER VISION A Degree Thesis
Abstract. Deep learning has become part of many state-of-the-art systems in multiple disciplines (specially in computer vision and speech processing). In this thesis Convolutional Neural Networks are used to solve the problem of recognizing people in images, both for verification and identification. Two different architectures, AlexNet and ...
PDF Novel Robust Computer Vision Algorithms for Micro Autonomous Systems
This Thesis describe a system for detecting and tracking peoples , from image and depth sen-sors data, to cope with the challenges of MAS perception.Our focus is on developing robust computer vision algorithms that provide robustness and efficiency for people detection and tracking from the MAS in real-time applications as mentioned earlier.
"The Application of Computer Vision, Machine and Deep Learning Algorith
In 2013, that changed with the launch of their new data science concentrated toolboxes that included Deep Learning , Image Processing , Computer Vision , and then a year later Statistics and Machine Learning . The main objective of my thesis was to research and explore the field of data science.
Download
These files provide the structure for your thesis and some guidelines for its content. An introductory guideline can also be found external page here. call_made Related Content
Theses : Computer Vision (CV) : Universität Hamburg
Below is a list with selected titles of B.Sc. and M.Sc. theses completed in our group to orient you towards potential topics. You can also see our recent papers for possible directions for a thesis. You should have some pre-knowledge in computer vision before starting a thesis. For a B.Sc. thesis you should have at least attended the lecture ...
PDF Object Proposals in Computer Vision
The eld of computer vision was initially conceived as a summer undergraduate project [3] in 1966. Notwithstanding the seemingly simple de nition of 'seeing', it has proved to be a tough problem to solve. Going beyond perception and interpretation of visual data, research in computer vision now encompasses the following areas:
Doctoral Theses
Doctoral Thesis, Zurich, ETH Zurich, 2024. external page: Research Collection call_made. Abstract add. The need for high- quality images is paramount in various applications to enable the extraction of detailed information effectively. For instance, superior magnetic resonance images (MRIs ...
Computer Vision really cool ideas for a thesis? : r/computervision
Your thesis could be based on UI and computer vision as they really are changing the land scape and help an open source project in the process. We also want to add image homography and feature tracking to the next release (1.3). We have quick release cycles as well (about every 3 months).
Dissertations / Theses: 'Computer Science. Computer vision ...
Video (online) Consult the top 50 dissertations / theses for your research on the topic 'Computer Science. Computer vision.'. Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA ...
Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis
Summary. We present GCD (for Generative Camera Dolly), a framework for synthesizing large-angle novel viewpoints of dynamic scenes from a single monocular video.Specifically, given any color video, along with precise instructions on how to rotate and/or translate the camera, our model can imagine what that same scene would look like from another perspective.
Computer vision undergraduate final project ideas
Computer Vision is the scientific subfield of AI concerned with developing algorithms to extract meaningful information from raw images, videos, and sensor data. ... I am a senior year Computer Science undergraduate student and I would like to do my final year project (thesis) about a computer vision task but I have some trouble to decide it. I ...
[P] Suggestion for Master's Thesis Project in Computer Vision
Project. Hello, I'm currently a university student entering into their Master's in Computer Science. I am heavily interested in the field of Deep Learning and in particular, Computer Vision. For my Bachelor's dissertation I worked on the classification of Alzheimer's Disease using MRI scans, and I heavily enjoyed the process.
Palantir Wins $480 Million AI Computer Vision Deal With US Army
The US Army awarded a $480 million contract to Palantir Technologies Inc. for work on a project called the Maven Smart System through 2029.. The Maven deal was disclosedWednesday as part of the Defense Department's daily contract announcements. Palantir declined to comment on the contract. The deal further extends Palantir's relationship with the military, and makes use of the company's ...
Computer Science Thesis Proposal
Computer Science Thesis Proposal May 28, 2024 1:00pm — 2:30pm Location: In Person - Traffic21 Classroom, Gates Hillman 6501 Speaker: MINGJIE SUN , Ph.D. Student, Computer Science Department, Carnegie Mellon University https://eric-mingjie.github.io/
Earth Remote Sensing and Geographic Information Systems
The work was supported by the Russian Foundation for Basic Research projects 16-29-09494-ofi-m, Methods of Computer Processing of Multispectral Earth Remote Sensing Data to Determine Plant Habitats in Special Forensic Examinations and no. 16-29-11683-ofi-m, High-Contrast Diffraction Gratings Integrated on a Crystal for Optical Information ...
Azure AI Vision at Microsoft Build 2024: Multimodal AI for Everyone
We have been pushing the boundaries of multimodal AI, combining natural language processing and computer vision to create powerful and intuitive solutions for a wide range of scenarios. In this blog post, we will introduce you to three of our latest multimodal models: GPT-4 Turbo with Vision, Phi-3-vision model, and the recently released model ...
Struggling to find a research topic in computer vision for masters' thesis
I am struggling to find a research topic for my masters thesis in Artificial Intelligence (computer vision topics). With a plethora of research already published and requirement of novelty, it's a real struggle finding a proper and practical research topic. Some topics I'v currently shortlisted are facial expression / emotion recognition ...
Palantir Wins $480 Million AI Computer Vision Deal With US Army
The US Army awarded a $480 million contract to Palantir Technologies Inc. for work on a project called the Maven Smart System through 2029.. The Maven deal was disclosed Wednesday as part of the ...
Samara Oblast Map
Samara Oblast is a region in the Middle Volga, bordering Ulyanovsk Oblast to the west, Tatarstan to the north, Orenburg Oblast to the east, and Saratov Oblast to the south. Photo: Brandmeister, Public domain. Photo: Юрий Гусев, CC BY-SA 3.0. Ukraine is facing shortages in its brave fight to survive. Please support Ukraine, because ...
Undergraduate thesis topics for computer vision : r/computervision
Computer Vision is the scientific subfield of AI concerned with developing algorithms to extract meaningful information from raw images, videos, and sensor data. ... Undergraduate thesis topics for computer vision . I am currently an undergraduate student about to start my thesis project. I have an average understanding of computer vision ...
Places to Visit in Samara
Western Park is Samara's largest indoor amusement park. This is a place where music and funny laughter never ceases, the games never stop for a moment, and the winners will enjoy a variety of prizes…. 8. Sculptural Composition Barge Haulers on the Volga.
A Bronze Age Landscape in the Russian Steppes: The Samara Valley Project
Part I Introduction and Overview of the Samara Valley Project 1995-2002. Ch. 1 The Samara Valley Project and the Evolution of Pastoral Economies in the Western Eurasian Steppes by David W. Anthony; Ch. 2 Archaeological Field Operations in the Lower Samara Valley, 1995-2001, with Observations on Srubnaya Pastoralism by David W. Anthony, Dorcas R. Brown, and Pavel F. Kuznetsov