Home - Ziad Al-Halah

Home -
Ziad Al-Halah
Ziad Al-Halah
Assistant Professor
Kahlert School of Computing
The University of Utah
50 Central Campus Dr.
Salt Lake City, UT 84112
ziad at cs.utah.edu
cs.utah.edu
Google Scholar
ResearchGate
DBLP
ORCID
Welcome
I am an Assistant Professor of Computer Science in the
Kahlert School of Computing
at the
University of Utah
Before that, I was Postdoctoral Fellow at the
University of Texas at Austin
, working with Prof.
Kristen Grauman
in the Computer Vision Group.
I was a resident researcher with the Artificial Intelligence group at NAVER in South Korea, and a visiting researcher with the Coretx / Computer Vision group at Twitter in the UK and with the Vision group at Disney Research in Pittsburgh.
I received my PhD with distinction (summa cum laude) from the
Karlsruhe Institute of Technology
in Germany, working in the
Computer Vision for Human Computer Interaction Laboratory
My research interests are in the areas of computer vision and artificial intelligence. I'm particularly interested in transfer learning, zero-shot learning, multimodal learning, and embodied AI.
Latest News
ICCV 2025
Two papers accepted, on automatic view selection in videos, and on material-controlled RIR generation.
CVPR 2025
[Highlight] Paper on weakly supervised view selection in videos.
I'm an Area Chair of
CVPR 2025
and
NeurIPS 2025
CVPR 2024
Paper on learning spatial features from audio-visual correspondence.
Outstanding Reviewer in
NeurIPS 2024
I'm an organizer of the Computer Vision for Fashion, Art, and Design (
CVFAD
) workshop in
CVPR 2024
I'm an organizer of the Ethical Considerations in Creative applications of Computer Vision (
EC3V
) workshop in
CVPR 2024
ICML 2023
Paper on efficient video search for episodic memory.
Outstanding Reviewer in
CVPR 2023
CVPR 2023
Paper on leveraging linguistic narrations as queries to supervise episodic memory search.
I'm an Area Chair of
ICCV 2023
and
AAAI 2023
NeurIPS 2022
Paper on arbitrary RIR prediction from few-shot multi-modal observations.
I'm an Area Chair of
BMVC 2022
and
ACCV 2022
CVPR 2022
Two papers on zero-shot experience learning and object-goal navigation.
ICLR 2022
Paper on environment predictive coding for visual navigation.
ICCV 2021
Paper on active audio-visual source separation.
Outstanding Reviewer in
CVPR 2021
I'm an Area Chair of International Conference on Machine Vision Applications,
MVA 2021
CVPR 2021
Two papers on semantic audio-visual navigation and dialog-based fashion retrieval.
ICLR 2021
Paper on navigation using audio-visual waypoints.
Publications
How Would It Sound? Material-Controlled Multimodal Acoustic Profile Generation for Indoor Scenes
Mahnoor Fatima Saad, Ziad Al-Halah
International Conference on Computer Vision
ICCV
),
Oct. 2025.
PDF
Project
arXiv
Supp
@inproceedings{saad2025materialrir, author = {Mahnoor Fatima Saad and Ziad Al-Halah},
title = {{How Would It Sound? Material-Controlled Multimodal Acoustic Profile Generation for Indoor Scenes}},
year = {2025},
booktitle = {International Conference on Computer Vision (ICCV)},
month = {Oct.},
doi = {},
arxivId = {2508.02905},
How would the sound in a studio change with a carpeted floor and acoustic tiles on the walls? We introduce the task of material-controlled acoustic profile generation, where, given an indoor scene with specific audio-visual characteristics, the goal is to generate a target acoustic profile based on a user-defined material configuration at inference time. We address this task with a novel encoder-decoder approach that encodes the scene's key properties from an audio-visual observation and generates the target Room Impulse Response (RIR) conditioned on the material specifications provided by the user. Our model enables the generation of diverse RIRs based on various material configurations defined dynamically at inference time. To support this task, we create a new benchmark, the Acoustic Wonderland Dataset, designed for developing and evaluating material-aware RIR prediction methods under diverse and challenging settings. Our results demonstrate that the proposed model effectively encodes material information and generates high-fidelity RIRs, outperforming several baselines and state-of-the-art methods.
Switch-a-View: View Selection Learned from Unlabeled In-the-wild Videos
Sagnik Majumder, Tushar Nagarajan, Ziad Al-Halah, Kristen Grauman
International Conference on Computer Vision
ICCV
),
Oct. 2025.
PDF
Project
arXiv
Supp
@inproceedings{majumder2025switchview, author = {Sagnik Majumder and Tushar Nagarajan and Ziad Al-Halah and Kristen Grauman},
title = {{Switch-a-View: View Selection Learned from Unlabeled In-the-wild Videos}},
year = {2025},
booktitle = {International Conference on Computer Vision (ICCV)},
month = {Oct.},
doi = {},
arxivId = {2412.18386},
We introduce Switch-a-View, a model that learns to automatically select the viewpoint to display at each timepoint when creating a how-to video. The key insight of our approach is how to train such a model from unlabeled--but human-edited--video samples. We pose a pretext task that pseudo-labels segments in the training videos for their primary viewpoint (egocentric or exocentric), and then discovers the patterns between the visual and spoken content in a how-to video on the one hand and its view-switch moments on the other hand. Armed with this predictor, our model can be applied to new multi-view video settings for orchestrating which viewpoint should be displayed when, even when such settings come with limited labels. We demonstrate our idea on a variety of real-world videos from HowTo100M and Ego-Exo4D, and rigorously validate its advantages.
Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Instructional Videos
Sagnik Majumder, Tushar Nagarajan, Ziad Al-Halah, Reina Pradhan, Kristen Grauman
IEEE Conference on Computer Vision and Pattern Recognition
CVPR
),
Jun. 2025.
Highlight.
PDF
Project
arXiv
Supp
@inproceedings{majumder2025whichview, author = {Sagnik Majumder and Tushar Nagarajan and Ziad Al-Halah and Reina Pradhan and Kristen Grauman},
title = {{Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Instructional Videos}},
year = {2025},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {Jun.},
doi = {},
arxivId = {2411.08753},
Given a multi-view video, which viewpoint is most informative for a human observer? Existing methods rely on heuristics or expensive 'best-view' supervision to answer this question, limiting their applicability. We propose a weakly supervised approach that leverages language accompanying an instructional multi-view video as a means to recover its most informative viewpoint(s). Our key hypothesis is that the more accurately an individual view can predict a viewagnostic text summary, the more informative it is. To put this into action, we propose a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels. Then, those pseudo-labels are used to train a view selector, together with an auxiliary camera pose predictor that enhances view-sensitivity. During inference, our model takes as input only a multi-view video--no language or camera poses--and returns the best viewpoint to watch at each timestep. On two challenging datasets comprised of diverse multi-camera setups and how-to activities, our model consistently outperforms state-of-the-art baselines, both with quantitative metrics and human evaluation.
Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos
Sagnik Majumder, Ziad Al-Halah, Kristen Grauman
IEEE Conference on Computer Vision and Pattern Recognition
CVPR
),
Jun. 2024.
PDF
Project
DOI
arXiv
Supp
@inproceedings{majumder2024egoav, author = {Sagnik Majumder and Ziad Al-Halah and Kristen Grauman},
title = {{Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos}},
year = {2024},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {Jun.},
doi = {10.1109/cvpr52733.2024.02555},
arxivId = {2307.04760},
We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos. Our method uses a masked auto-encoding framework to synthesize masked binaural audio through the synergy of audio and vision, thereby learning useful spatial relationships between the two modalities. We use our pretrained features to tackle two downstream video tasks requiring spatial understanding in social scenarios: active speaker detection and spatial audio denoising. Through extensive experiments, we show that our features are generic enough to improve over multiple state-of-the-art baselines on both tasks on two challenging egocentric video datasets that offer binaural audio, EgoCom and Easy-Com.
SpotEM: Efficient Video Search for Episodic Memory
Santhosh K. Ramakrishnan, Ziad Al-Halah, Kristen Grauman
International Conference on Machine Learning
ICML
),
Jul. 2023.
PDF
Project
arXiv
@inproceedings{ramakrishnan2023soptem, author = {Santhosh K. Ramakrishnan and Ziad Al-Halah and Kristen Grauman},
title = {{SpotEM: Efficient Video Search for Episodic Memory}},
year = {2023},
booktitle = {International Conference on Machine Learning (ICML)},
month = {Jul.},
doi = {},
arxivId = {2306.15850},
The goal in episodic memory (EM) is to search a long egocentric video to answer a natural language query (e.g., ``where did I leave my purse?''). Existing EM methods exhaustively extract expensive fixed-length clip features to look everywhere in the video for the answer, which is infeasible for long wearable camera videos that span hours or even days. We propose SpotEM, an approach to achieve efficiency for a given EM method while maintaining good accuracy. SpotEM consists of three key ideas: a novel clip selector that learns to identify promising video regions to search conditioned on the language query; a set of low-cost semantic indexing features that capture the context of rooms, objects, and interactions that suggest where to look; and distillation losses that address optimization issues arising from end-to-end joint training of the clip selector and EM model. Our experiments on 200+ hours of video from the Ego4D EM Natural Language Queries benchmark and three different EM models demonstrate the effectiveness of our approach: computing only 10%-25% of the clip features, we preserve 84%-95%+ of the original EM model's accuracy.
NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory
Santhosh K. Ramakrishnan, Ziad Al-Halah, Kristen Grauman
IEEE Conference on Computer Vision and Pattern Recognition
CVPR
),
June 2023.
PDF
Project
DOI
arXiv
@inproceedings{ramakrishnan2023naq, author = {Santhosh K. Ramakrishnan and Ziad Al-Halah and Kristen Grauman},
title = {{NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory}},
year = {2023},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
doi = {10.1109/cvpr52729.2023.00647},
arxivId = {2301.00746},
Searching long egocentric videos with natural language queries (NLQ) has compelling applications in augmented reality and robotics, where a fluid index into everything that a person (agent) has seen before could augment human memory and surface relevant information on demand. However, the structured nature of the learning problem (free-form text query inputs, localized video temporal window outputs) and its needle-in-a-haystack nature makes it both technically challenging and expensive to supervise. We introduce Narrations-as-Queries (NaQ), a data augmentation strategy that transforms standard video-text narrations into training data for a video query localization model. Validating our idea on the Ego4D benchmark, we find it has tremendous impact in practice. NaQ improves multiple top models by substantial margins (even doubling their accuracy), and yields the very best results to date on the Ego4D NLQ challenge, soundly outperforming all challenge winners in the CVPR and ECCV 2022 competitions and topping the current public leaderboard. Beyond achieving the state-of-the-art for NLQ, we also demonstrate unique properties of our approach such as the ability to perform zero-shot and few-shot NLQ, and improved performance on queries about long-tail object categories.
A Domain-Agnostic Approach for Characterization of Lifelong Learning Systems
Megan M Baker, Alexander New, Mario Aguilar-Simon, Ziad Al-Halah, Sébastien MR Arnold, and 42 more.
Neural Networks
NN
),
March 2023.
PDF
DOI
arXiv
@inproceedings{llsysNN2023, author = {Megan M Baker and Alexander New and Mario Aguilar-Simon and Ziad Al-Halah and Sébastien MR Arnold and and 42 more.},
title = {{A Domain-Agnostic Approach for Characterization of Lifelong Learning Systems}},
year = {2023},
booktitle = {Neural Networks (NN)},
month = {March},
doi = {10.1016/j.neunet.2023.01.007},
arxivId = {2301.07799},
Despite the advancement of machine learning techniques in recent years, state-of-the-art systems lack robustness to “real world” events, where the input distributions and tasks encountered by the deployed systems will not be limited to the original training context, and systems will instead need to adapt to novel distributions and tasks while deployed. This critical gap may be addressed through the development of “Lifelong Learning” systems that are capable of 1) Continuous Learning, 2) Transfer and Adaptation, and 3) Scalability. Unfortunately, efforts to improve these capabilities are typically treated as distinct areas of research that are assessed independently, without regard to the impact of each separate capability on other aspects of the system. We instead propose a holistic approach, using a suite of metrics and an evaluation framework to assess Lifelong Learning in a principled way that is agnostic to specific domains or system techniques. Through five case studies, we show that this suite of metrics can inform the development of varied and complex Lifelong Learning systems. We highlight how the proposed suite of metrics quantifies performance trade-offs present during Lifelong Learning system development - both the widely discussed Stability-Plasticity dilemma and the newly proposed relationship between Sample Efficient and Robust Learning. Further, we make recommendations for the formulation and use of metrics to guide the continuing development of Lifelong Learning systems and assess their progress in the future.
Few-Shot Audio-Visual Learning of Environment Acoustics
Sagnik Majumder, Changan Chen*, Ziad Al-Halah* and Kristen Grauman
Conference on Neural Information Processing Systems
NeurIPS
),
Nov. 2022.
PDF
Project
arXiv
@inproceedings{majumder2022fsrir, author = {Sagnik Majumder and Changan Chen* and Ziad Al-Halah* and Kristen Grauman},
title = {{Few-Shot Audio-Visual Learning of Environment Acoustics}},
year = {2022},
booktitle = {Conference on Neural Information Processing Systems (NeurIPS)},
month = {Nov.},
doi = {},
arxivId = {2206.04006},
Room impulse response (RIR) functions capture how the surrounding physical environment transforms the sounds heard by a listener, with implications for various applications in AR, VR, and robotics. Whereas traditional methods to estimate RIRs assume dense geometry and/or sound measurements throughout the environment, we explore how to infer RIRs based on a sparse set of images and echoes observed in the space. Towards that goal, we introduce a transformer-based method that uses self-attention to build a rich acoustic context, then predicts RIRs of arbitrary query source-receiver locations through cross-attention. Additionally, we design a novel training objective that improves the match in the acoustic signature between the RIR predictions and the targets. In experiments using a state-of-the-art audio-visual simulator for 3D environments, we demonstrate that our method successfully generates arbitrary RIRs, outperforming state-of-the-art methods and -- in a major departure from traditional methods -- generalizing to novel environments in a few-shot manner.
Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation
Ziad Al-Halah, Santhosh K. Ramakrishnan and Kristen Grauman
IEEE Conference on Computer Vision and Pattern Recognition
CVPR
),
June 2022.
PDF
Project
DOI
arXiv
@inproceedings{al-halah2022zsel, author = {Ziad Al-Halah and Santhosh K. Ramakrishnan and Kristen Grauman},
title = {{Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation}},
year = {2022},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
doi = {10.1109/cvpr52688.2022.01652},
arxivId = {2202.02440},
In reinforcement learning for visual navigation, it is common to develop a model for each new task, and train that model from scratch with task-specific interactions in 3D environments. However, this process is expensive; massive amounts of interactions are needed for the model to generalize well. Moreover, this process is repeated whenever there is a change in the task type or the goal modality. We present a unified approach to visual navigation using a novel modular transfer learning model. Our model can effectively leverage its experience from one source task and apply it to multiple target tasks (e.g., ObjectNav, RoomNav, ViewNav) with various goal modalities (e.g., image, sketch, audio, label). Furthermore, our model enables zero-shot experience learning, whereby it can solve the target tasks without receiving any task-specific interactive training. Our experiments on multiple photorealistic datasets and challenging tasks show that our approach learns faster, generalizes better, and outperforms SoTA models by a significant margin.
PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning
Santhosh K. Ramakrishnan, Devendra S. Chaplot, Ziad Al-Halah, Jitendra Malik, Kristen Grauman
IEEE Conference on Computer Vision and Pattern Recognition
CVPR
),
June 2022.
Oral.
PDF
Project
DOI
arXiv
@inproceedings{ramakrishnan2022poni, author = {Santhosh K. Ramakrishnan and Devendra S. Chaplot and Ziad Al-Halah and Jitendra Malik and Kristen Grauman},
title = {{PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning}},
year = {2022},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
doi = {10.1109/cvpr52688.2022.01832},
arxivId = {2201.10029},
State-of-the-art approaches to ObjectGoal navigation rely on reinforcement learning and typically require significant computational resources and time for learning. We propose Potential functions for ObjectGoal Navigation with Interaction-free learning (PONI), a modular approach that disentangles the skills of `where to look?' for an object and `how to navigate to (x, y)?'. Our key insight is that `where to look?' can be treated purely as a perception problem, and learned without environment interactions. To address this, we propose a network that predicts two complementary potential functions conditioned on a semantic map and uses them to decide where to look for an unseen object. We train the potential function network using supervised learning on a passive dataset of top-down semantic maps, and integrate it into a modular framework to perform ObjectGoal navigation. Experiments on Gibson and Matterport3D demonstrate that our method achieves the state-of-the-art for ObjectGoal navigation while incurring up to 1,600x less computational cost for training.
Environment Predictive Coding for Visual Navigation
Santhosh K. Ramakrishnan, Tushar Nagarajan, Ziad Al-Halah, Kristen Grauman
International Conference on Learning Representations
ICLR
),
April 2022.
PDF
Project
arXiv
@inproceedings{ramakrishnan2022epc, author = {Santhosh K. Ramakrishnan and Tushar Nagarajan and Ziad Al-Halah and Kristen Grauman},
title = {{Environment Predictive Coding for Visual Navigation}},
year = {2022},
booktitle = {International Conference on Learning Representations (ICLR)},
month = {April},
doi = {},
arxivId = {2102.02337},
We introduce environment predictive coding, a self-supervised approach to learn environment-level representations for embodied agents. In contrast to prior work on self-supervised learning for individual images, we aim to encode a 3D environment using a series of images observed by an agent moving in it. We learn these representations via a masked-zone prediction task, which segments an agent’s trajectory into zones and then predicts features of randomly masked zones, conditioned on the agent’s camera poses. This explicit spatial conditioning encourages learning representations that capture the geometric and semantic regularities of 3D environments. We learn such representations on a collection of video walkthroughs and demonstrate successful transfer to multiple downstream navigation tasks. Our experiments on the real-world scanned 3D environments of Gibson and Matterport3D show that our method obtains 2 - 6× higher sample-efﬁciency and up to 57% higher performance over standard image-representation learning.
Move2Hear: Active Audio-Visual Source Separation
Sagnik Majumder, Ziad Al-Halah and Kristen Grauman
IEEE International Conference on Computer Vision
ICCV
),
Oct. 2021.
PDF
Project
DOI
arXiv
@inproceedings{majumder2021move2hear, author = {Sagnik Majumder and Ziad Al-Halah and Kristen Grauman},
title = {{Move2Hear: Active Audio-Visual Source Separation}},
year = {2021},
booktitle = {IEEE International Conference on Computer Vision (ICCV)},
month = {Oct.},
doi = {10.1109/iccv48922.2021.00034},
arxivId = {2105.07142},
We introduce the active audio-visual source separation problem, where an agent must move intelligently in order to better isolate the sounds coming from an object of interest in its environment. The agent hears multiple audio sources simultaneously (e.g., a person speaking down the hall in a noisy household) and must use its eyes and ears to automatically separate out the sounds originating from the target object within a limited time budget. Towards this goal, we introduce a reinforcement learning approach that trains movement policies controlling the agent’s camera and microphone placement over time, guided by the improvement in predicted audio separation quality. We demonstrate our approach in scenarios motivated by both augmented reality (system is already co-located with the target object) and mobile robotics (agent begins arbitrarily far from the target object). Using state-of-the-art realistic audio-visual simulations in 3D environments, we demonstrate our model’s ability to find minimal movement sequences with maximal pay off for audio source separation.
Semantic Audio-Visual Navigation
Changan Chen, Ziad Al-Halah and Kristen Grauman
IEEE Conference on Computer Vision and Pattern Recognition
CVPR
),
Jun. 2021.
PDF
Project
DOI
arXiv
Supp
@inproceedings{chen2021savi, author = {Changan Chen and Ziad Al-Halah and Kristen Grauman},
title = {{Semantic Audio-Visual Navigation}},
year = {2021},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {Jun.},
doi = {10.1109/cvpr46437.2021.01526},
arxivId = {2012.11583},
Recent work on audio-visual navigation assumes a constantly-sounding target and restricts the role of audio to signaling the target’s position. We introduce semantic audio-visual navigation, where objects in the environment make sounds consistent with their semantic meaning (e.g., toilet flushing, door creaking) and acoustic events are sporadic or short in duration. We propose a transformer-based model to tackle this new semantic AudioGoal task, incorporating an inferred goal descriptor that captures both spatial and semantic properties of the target. Our model’s persistent multimodal memory enables it to reach the goal even long after the acoustic event stops. In support of the new task, we also expand the SoundSpaces audio simulations to provide semantically grounded sounds for an array of objects in Matterport3D. Our method strongly outperforms existing audio-visual navigation methods by learning to associate semantic, acoustic, and visual cues.
Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback
Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman and Rogerio Feris
IEEE Conference on Computer Vision and Pattern Recognition
CVPR
),
Jun. 2021.
PDF
Project
DOI
arXiv
Supp
@inproceedings{fashioniq2021, author = {Hui Wu and Yupeng Gao and Xiaoxiao Guo and Ziad Al-Halah and Steven Rennie and Kristen Grauman and Rogerio Feris},
title = {{Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback}},
year = {2021},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {Jun.},
doi = {10.1109/cvpr46437.2021.01115},
arxivId = {1905.12794},
Conversational interfaces for the detail-oriented retail fashion domain are more natural, expressive, and user friendly than classical keyword-based search interfaces. In this paper, we introduce the Fashion IQ dataset to support and advance research on interactive fashion image retrieval. Fashion IQ is the first fashion dataset to provide human-generated captions that distinguish similar pairs of garment images together with side-information consisting of real-world product descriptions and derived visual attribute labels for these images. We provide a detailed analysis of the characteristics of the Fashion IQ data, and present a transformer-based user simulator and interactive image retriever that can seamlessly integrate visual attributes with image features, user feedback, and dialog history, leading to improved performance over the state of the art in dialog-based image retrieval. We believe that our dataset will encourage further work on developing more natural and real-world applicable conversational shopping assistants.
Learning to Set Waypoints for Audio-Visual Navigation
Changan Chen, Sagnik Majumder, Ziad Al-Halah, Ruohan Gao, Santhosh K. Ramakrishnan and Kristen Grauman
International Conference on Learning Representations
ICLR
),
May 2021.
PDF
Project
arXiv
@inproceedings{chen2021avwan, author = {Changan Chen and Sagnik Majumder and Ziad Al-Halah and Ruohan Gao and Santhosh K. Ramakrishnan and Kristen Grauman},
title = {{Learning to Set Waypoints for Audio-Visual Navigation}},
year = {2021},
booktitle = {International Conference on Learning Representations (ICLR)},
month = {May},
doi = {},
arxivId = {2008.09622},
In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source (e.g.,a phone ringing in another room). Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations. We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements: 1) waypoints that are dynamically set and learned end-to-end within the navigation policy, and 2) an acoustic memory that provides a structured, spatially grounded record of what the agent has heard as it moves. Both new ideas capitalize on the synergy of audio and visual data for revealing the geometry of an unmapped space. We demonstrate our approach on two challenging datasets of real-world 3D scenes, Replica and Matterport3D. Our model improves the state of the art by a substantial margin, and our experiments reveal that learning the links between sights, sounds, and space is essential for audio-visual navigation.
Modeling Fashion Influence from Photos
Ziad Al-Halah and Kristen Grauman
IEEE Transactions on Multimedia
TMM
),
2020.
PDF
Project
DOI
@inproceedings{al-halah2020b, author = {Ziad Al-Halah and Kristen Grauman},
title = {{Modeling Fashion Influence from Photos}},
year = {2020},
booktitle = {IEEE Transactions on Multimedia (TMM)},
month = {},
doi = {10.1109/TMM.2020.3037459},
The evolution of clothing styles and their migration across the world is intriguing, yet difficult to describe quantitatively. We propose to discover and quantify fashion influences from catalog and social media photos. We explore fashion influence along two channels: geolocation and fashion brands. We introduce an approach that detects which of these entities influence which other entities in terms of propagating their styles. We then leverage the discovered influence patterns to inform a novel forecasting model that predicts the future popularity of any given style within any given city or brand. To demonstrate our idea, we leverage public large-scale datasets of 7.7M Instagram photos from 44 major world cities (where styles are worn with variable frequency) as well as 41K Amazon product photos (where styles are purchased with variable frequency). Our model learns directly from the image data how styles move between locations and how certain brands affect each other’s designs in a predictable way. The discovered influence relationships reveal how both cities and brands exert and receive fashion influence for an array of visual styles inferred from the images. Furthermore, the proposed forecasting model achieves state-of-the-art results for challenging style forecasting tasks. Our results indicate the advantage of grounding visual style evolution both spatially and temporally, and for the first time, they quantify the propagation of inter-brand and inter-city influences.
Occupancy Anticipation for Efficient Exploration and Navigation
Santhosh K. Ramakrishnan, Ziad Al-Halah and Kristen Grauman
European Conference on Computer Vision
ECCV
),
Aug. 2020.
Spotlight.
Winner of the 2020 Habitat Challenge (PointNav)
PDF
Project
DOI
arXiv
Supp
@inproceedings{ramakrishnan2020occant, author = {Santhosh K. Ramakrishnan and Ziad Al-Halah and Kristen Grauman},
title = {{Occupancy Anticipation for Efficient Exploration and Navigation}},
year = {2020},
booktitle = {European Conference on Computer Vision (ECCV)},
month = {Aug.},
doi = {10.1007/978-3-030-58558-7_24},
arxivId = {2008.09285},
State-of-the-art navigation methods leverage a spatial memory to generalize to new environments, but their occupancy maps are limited to capturing the geometric structures directly observed by the agent. We propose occupancy anticipation, where the agent uses its egocentric RGB-D observations to infer the occupancy state beyond the visible regions. In doing so, the agent builds its spatial awareness more rapidly, which facilitates efficient exploration and navigation in 3D environments. By exploiting context in both the egocentric views and top-down maps our model successfully anticipates a broader map of the environment, with performance significantly better than strong baselines. Furthermore, when deployed for the sequential decision-making tasks of exploration and navigation, our model outperforms state-of-the-art methods on the Gibson and Matterport3D datasets. Our approach is the winning entry in the 2020 Habitat PointNav Challenge.
SoundSpaces: Audio-Visual Navigation in 3D Environments
Changan Chen, Unnat Jain, Carl Schissler, Sebastia Vicenc Amengual Gari, Ziad Al-Halah, Vamsi Ithapu, Philip W Robinson and Kristen Grauman
European Conference on Computer Vision
ECCV
),
Aug. 2020.
Spotlight.
PDF
Project
DOI
arXiv
Supp
@inproceedings{chen2020audionav, author = {Changan Chen and Unnat Jain and Carl Schissler and Sebastia Vicenc Amengual Gari and Ziad Al-Halah and Vamsi Ithapu and Philip W Robinson and Kristen Grauman},
title = {{SoundSpaces: Audio-Visual Navigation in 3D Environments}},
year = {2020},
booktitle = {European Conference on Computer Vision (ECCV)},
month = {Aug.},
doi = {10.1007/978-3-030-58539-6_2},
arxivId = {1912.11474},
Moving around in the world is naturally a multisensory experience, but today’s embodied agents are deaf—restricted to solely their visual perception of the environment. We introduce audio-visual navigation for complex, acoustically and visually realistic 3D environments. By both seeing and hearing, the agent must learn to navigate to a sounding object. We propose a multi-modal deep reinforcement learning approach to train navigation policies end-to-end from a stream of egocentric audio-visual observations, allowing the agent to (1) discover elements of the geometry of the physical space indicated by the reverberating audio and (2) detect and follow sound-emitting targets. We further introduce SoundSpaces: a first-of-its-kind dataset of audio renderings based on geometrical acoustic simulations for two sets of publicly available 3D environments (Matterport3D and Replica), and we instrument Habitat to support the new sensor, making it possible to insert arbitrary sound sources in an array of real-world scanned environments. Our results show that audio greatly benefits embodied visual navigation in 3D spaces, and our work lays groundwork for new research in embodied AI with audio-visual perception.
VisualEchoes: Spatial Image Representation Learning through Echolocation
Ruohan Gao, Changan Chen, Ziad Al-Halah, Carl Schissler and Kristen Grauman
European Conference on Computer Vision
ECCV
),
Aug. 2020.
PDF
Project
DOI
arXiv
Supp
@inproceedings{gao2020visualechoes, author = {Ruohan Gao and Changan Chen and Ziad Al-Halah and Carl Schissler and Kristen Grauman},
title = {{VisualEchoes: Spatial Image Representation Learning through Echolocation}},
year = {2020},
booktitle = {European Conference on Computer Vision (ECCV)},
month = {Aug.},
doi = {10.1007/978-3-030-58545-7_38},
arxivId = {2005.01616},
Several animal species (e.g., bats, dolphins, and whales) and even visually impaired humans have the remarkable ability to perform echolocation: a biological sonar used to perceive spatial layout and locate objects in the world. We explore the spatial cues contained in echoes and how they can benefit vision tasks that require spatial reasoning. First we capture echo responses in photo-realistic 3D indoor scene environments. Then we propose a novel interaction-based representation learning framework that learns useful visual features via echolocation. We show that the learned image features are useful for multiple downstream vision tasks requiring spatial reasoning—monocular depth estimation, surface normal estimation, and visual navigation—with results comparable or even better than heavily supervised pre-training. Our work opens a new path for representation learning for embodied agents, where supervision comes from interacting with the physical world.
From Paris to Berlin: Discovering Fashion Style Influences Around the World
Ziad Al-Halah and Kristen Grauman
IEEE Conference on Computer Vision and Pattern Recognition
CVPR
),
Seattle, USA,
June 2020.
PDF
Project
DOI
arXiv
@inproceedings{al-halah2020, author = {Ziad Al-Halah and Kristen Grauman},
title = {{From Paris to Berlin: Discovering Fashion Style Influences Around the World}},
year = {2020},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
doi = {10.1109/cvpr42600.2020.01015},
arxivId = {2004.01316},
The evolution of clothing styles and their migration across the world is intriguing, yet difficult to describe quantitatively. We propose to discover and quantify fashion influences from everyday images of people wearing clothes. We introduce an approach that detects which cities influence which other cities in terms of propagating their styles. We then leverage the discovered influence patterns to inform a forecasting model that predicts the popularity of any given style at any given city into the future. Demonstrating our idea with GeoStyle---a large-scale dataset of 7.7M images covering 44 major world cities, we present the discovered influence relationships, revealing how cities exert and receive fashion influence for an array of 50 observed visual styles. Furthermore, the proposed forecasting model achieves state-of-the-art results for a challenging style forecasting task, showing the advantage of grounding visual style evolution both spatially and temporally.
Smile, Be Happy :) Emoji Embedding for Visual Sentiment Analysis
Ziad Al-Halah, Andrew Aitken, Wenzhe Shi and Jose Caballero
IEEE International Conference on Computer Vision Workshops
ICCV
),
Seoul, Korea,
Oct. 2019.
PDF
Project
DOI
arXiv
Dataset
@inproceedings{al-halah2019, author = {Ziad Al-Halah and Andrew Aitken and Wenzhe Shi and Jose Caballero},
title = {{Smile, Be Happy :) Emoji Embedding for Visual Sentiment Analysis}},
year = {2019},
booktitle = {IEEE International Conference on Computer Vision Workshops (ICCV)},
month = {Oct.},
doi = {10.1109/iccvw.2019.00550},
arxivId = {1907.06160},
Due to the lack of large-scale datasets, the prevailing approach in visual sentiment analysis is to leverage models trained for object classification in large datasets like ImageNet. However, objects are sentiment neutral which hinders the expected gain of transfer learning for such tasks. In this work, we propose to overcome this problem by learning a novel sentiment-aligned image embedding that is better suited for subsequent visual sentiment analysis. Our embedding leverages the intricate relation between emojis and images in large-scale and readily available data from social media. Emojis are language-agnostic, consistent, and carry a clear sentiment signal which make them an excellent proxy to learn a sentiment aligned embedding. Hence, we construct a novel dataset of 4 million images collected from Twitter with their associated emojis. We train a deep neural model for image embedding using emoji prediction task as a proxy. Our evaluation demonstrates that the proposed embedding outperforms the popular object-based counterpart consistently across several sentiment analysis benchmarks. Furthermore, without bell and whistles, our compact, effective and simple embedding outperforms the more elaborate and customized state-of-the-art deep models on these public benchmarks. Additionally, we introduce a novel emoji representation based on their visual emotional response which support a deeper understanding of the emoji modality and their usage on social media.
Traversing the Continuous Spectrum of Image Retrieval with Deep Dynamic Models
Ziad Al-Halah, Andreas M. Lehrmann and Leonid Sigal
arXiv
arXiv
),
2019.
PDF
arXiv
@inproceedings{al-halah2019b, author = {Ziad Al-Halah and Andreas M. Lehrmann and Leonid Sigal},
title = {{Traversing the Continuous Spectrum of Image Retrieval with Deep Dynamic Models}},
year = {2019},
booktitle = {arXiv (arXiv)},
month = {},
doi = {},
arxivId = {1812.00202},
We introduce the first work to tackle the image retrieval problem as a continuous operation. While the proposed approaches in the literature can be roughly categorized into two main groups: category- and instance-based retrieval, in this work we show that the retrieval task is much richer and more complex. Image similarity goes beyond this discrete vantage point and spans a continuous spectrum among the classical operating points of category and instance similarity. However, current retrieval models are static and incapable of exploring this rich structure of the retrieval space since they are trained and evaluated with a single operating point as a target objective. Hence, we introduce a novel retrieval model that for a given query is capable of producing a dynamic embedding that can target an arbitrary point along the continuous retrieval spectrum. Our model disentangles the visual signal of a query image into its basic components of categorical and attribute information. Furthermore, using a continuous control parameter our model learns to reconstruct a dynamic embedding of the query by mixing these components with different proportions to target a specific point along the retrieval simplex. We demonstrate our idea in a comprehensive evaluation of the proposed model and highlight the advantages of our approach against a set of well-established discrete retrieval models.
SPaSe-Multi-Label Page Segmentation for Presentation Slides
Monica Haurilet, Ziad Al-Halah and Rainer Stiefelhagen
IEEE Winter Conference on Applications of Computer Vision
WACV
),
Hawaii, USA,
Jan. 2019.
PDF
Project
DOI
Supp
Slides
@inproceedings{haurilet2019, author = {Monica Haurilet and Ziad Al-Halah and Rainer Stiefelhagen},
title = {{SPaSe-Multi-Label Page Segmentation for Presentation Slides}},
year = {2019},
booktitle = {IEEE Winter Conference on Applications of Computer Vision (WACV)},
month = {Jan.},
doi = {10.1109/wacv.2019.00082},
We introduce the first benchmark dataset for slide-page segmentation.Presentation slides are one of the most prominent document types used to exchange ideas across the web, educational institutes and businesses. This document format is marked with a complex layout which contains a rich variety of graphical (e.g. diagram, logo), textual (e.g. heading, affiliation) and structural components(e.g. enumeration, legend). This vast and popular knowledge source is still unattainable by modern machine learning techniques due to lack of annotated data. To tackle this issue, we introduce SPaSe (Slide Page Segmentation),a novel dataset containing in total dense, pixel-wise annotations of 25 classes for 2000 slides. We show that slide segmentation reveals some interesting properties that characterize this task. Unlike the common image segmentation problem, disjoint classes tend to have a high overlap of regions, thus posing this segmentation task as a multi-label problem. Furthermore, many of the frequently encountered classes in slides are location sensitive (e.g. title, footnote).Hence, we believe our dataset represents a challenging and interesting benchmark for novel segmentation models. Finally, we evaluate state-of-the-art segmentation networks on our dataset and show that they are suitable for developing deep learning models without any need of pre-training.The annotations will be released to the public to foster further research on this interesting task
Informed Democracy: Voting-based Novelty Detection for Action Recognition
Alina Roitberg*, Ziad Al-Halah* and Rainer Stiefelhagen
The British Machine Vision Conference
BMVC
),
Newcastle, UK,
Sept. 2018.
= equal contribution).
PDF
arXiv
Poster
@inproceedings{roitberg2018, author = {Alina Roitberg* and Ziad Al-Halah* and Rainer Stiefelhagen},
title = {{Informed Democracy: Voting-based Novelty Detection for Action Recognition}},
year = {2018},
booktitle = {The British Machine Vision Conference (BMVC)},
month = {Sept.},
doi = {},
arxivId = {1810.12819},
Novelty detection is crucial for real-life applications. While it is common in activity recognition to assume a closed-set setting, i.e. test samples are always of training categories, this assumption is impractical in a real-world scenario. Test samples can be of various categories including those never seen before during training. Thus, being able to know what we know and what we do not know is decisive for the model to avoid what can be catastrophic consequences. We present in this work a novel approach for identifying samples of activity classes that are not previously seen by the classifier. Our model employs a voting-based scheme that leverages the estimated uncertainty of the individual classifiers in their predictions to measure the novelty of a new input sample. Furthermore, the voting is privileged to a subset of informed classifiers that can best estimate whether a sample is novel or not when it is classified to a certain known category. In a thorough evaluation on UCF-101 and HMDB-51, we show that our model consistently outperforms state-of-the-art in novelty detection. Additionally, by combining our model with off-the-shelf zero-shot learning (ZSL) approaches, our model leads to a significant improvement in action classification accuracy for the generalized ZSL setting.
MoQA – A Multi-Modal Question Answering Architecture
Monica Haurilet, Ziad Al-Halah and Rainer Stiefelhagen
European Conference on Computer Vision Workshops
ECCV
),
Munich, Germany,
Sept. 2018.
Winner of the TQA challenge at CVPR 2017
PDF
DOI
Slides
Poster
@inproceedings{haurilet2018, author = {Monica Haurilet and Ziad Al-Halah and Rainer Stiefelhagen},
title = {{MoQA – A Multi-Modal Question Answering Architecture}},
year = {2018},
booktitle = {European Conference on Computer Vision Workshops (ECCV)},
month = {Sept.},
doi = {10.1007/978-3-030-11018-5_9},
Multi-Modal Machine Comprehension (M3C) deals with extracting knowledge from multiple modalities such as figures, diagrams and text. Particularly, Textbook Question Answering (TQA) focuses on questions based on the school curricula, where the text and diagrams are extracted from textbooks. A subset of questions cannot be answered solely based on diagrams, but requires external knowledge of the surrounding text. In this work, we propose a novel deep model that is able to handle different knowledge modalities in the context of the question answering task. We compare three different information representations encountered in TQA: a visual representation learned from images, a graph representation of diagrams and a language-based representation learned from accompanying text. We evaluate our model on the TQAdataset that contains text and diagrams from the sixth grade material.Even though our model obtains competing results compared to state-of-the-art, we still witness a significant gap in performance compared to humans. We discuss in this work the shortcomings of the model and show the reason behind the large gap to human performance, by exploring the distribution of the multiple classes of mistakes that the model makes.
Fashion Forward: Forecasting Visual Style in Fashion
Ziad Al-Halah, Rainer Stiefelhagen and Kristen Grauman
IEEE International Conference on Computer Vision
ICCV
),
Venice, Italy,
Oct. 2017.
PDF
Project
DOI
arXiv
Supp
@inproceedings{Al-Halah2017b, author = {Ziad Al-Halah and Rainer Stiefelhagen and Kristen Grauman},
title = {{Fashion Forward: Forecasting Visual Style in Fashion}},
year = {2017},
booktitle = {IEEE International Conference on Computer Vision (ICCV)},
month = {Oct.},
doi = {10.1109/iccv.2017.50},
arxivId = {1705.06394},
What is the future of fashion? Tackling this question from a data-driven vision perspective, we propose to forecast visual style trends before they occur. We introduce the first approach to predict the future popularity of styles discovered from fashion images in an unsupervised manner. Using these styles as a basis, we train a forecasting model to represent their trends over time. The resulting model can hypothesize new mixtures of styles that will become popular in the future, discover style dynamics (trendy vs. classic), and name the key visual attributes that will dominate tomorrow’s fashion. We demonstrate our idea applied to three datasets encapsulating 80,000 fashion products sold across six years on Amazon. Results indicate that fashion forecasting benefits greatly from visual analysis, much more than textual or meta-data cues surrounding products.
Automatic Discovery, Association Estimation and Learning of Semantic Attributes for a Thousand Categories
Ziad Al-Halah and Rainer Stiefelhagen
IEEE Conference on Computer Vision and Pattern Recognition
CVPR
),
Honolulu, Hawaii, USA,
Jul. 2017.
PDF
DOI
arXiv
Discovered Attributes on ILSVRC2012
ImageNet Zero-Shot Splits
Supp
@inproceedings{Al-Halah2017, author = {Ziad Al-Halah and Rainer Stiefelhagen},
title = {{Automatic Discovery, Association Estimation and Learning of Semantic Attributes for a Thousand Categories}},
year = {2017},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {Jul.},
doi = {10.1109/cvpr.2017.543},
arxivId = {1704.03607},
Attribute-based recognition models, due to their impressive performance and their ability to generalize well on novel categories, have been widely adopted for many computer vision applications. However, usually both the attribute vocabulary and the class-attribute associations have to be provided manually by domain experts or large number of annotators. This is very costly and not necessarily optimal regarding recognition performance, and most importantly, it limits the applicability of attribute-based models to large scale data sets. To tackle this problem, we propose an end-to-end unsupervised attribute learning approach. We utilize online text corpora to automatically discover a salient and discriminative vocabulary that correlates well with the human concept of semantic attributes. Moreover, we propose a deep convolutional model to optimize class-attribute associations with a linguistic prior that accounts for noise and missing data in text. In a thorough evaluation on ImageNet, we demonstrate that our model is able to efficiently discover and learn semantic attributes at a large scale. Furthermore, we demonstrate that our model outperforms the state-of-the-art in zero-shot learning on three data sets: ImageNet, Animals with Attributes and aPascal/aYahoo. Finally, we enable attribute-based learning on ImageNet and will share the attributes and associations for future research.
Recovering the Missing Link: Predicting Class-Attribute Associations for Unsupervised Zero-Shot Learning
Ziad Al-Halah, Makarand Tapaswi and Rainer Stiefelhagen
IEEE Conference on Computer Vision and Pattern Recognition
CVPR
),
Las Vegas, NV, USA,
Jun. 2016.
PDF
DOI
arXiv
aPaY: GoogLeNet_Feat
AwA: GoogLeNet_Feat
Supp
@inproceedings{Al-Halah2016, author = {Ziad Al-Halah and Makarand Tapaswi and Rainer Stiefelhagen},
title = {{Recovering the Missing Link: Predicting Class-Attribute Associations for Unsupervised Zero-Shot Learning}},
year = {2016},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {Jun.},
doi = {10.1109/CVPR.2016.643},
arxivId = {1610.04787},
Collecting training images for all visual categories is not only expensive but also impractical. Zero-shot learning (ZSL), especially using attributes, offers a pragmatic solution to this problem. However, at test time most attribute-based methods require a full description of attribute associations for each unseen class. Providing these associations is time consuming and often requires domain specific knowledge. In this work, we aim to carry out attribute-based zero-shot classification in an unsupervised manner. We propose an approach to learn relations that couples class embeddings with their corresponding attributes. Given only the name of an unseen class, the learned relationship model is used to automatically predict the class-attribute associations. Furthermore, our model facilitates transferring attributes across data sets without additional effort. Integrating knowledge from multiple sources results in a significant additional improvement in performance. We evaluate on two public data sets: Animals with Attributes and aPascal/aYahoo. Our approach outperforms state-of-the-art methods in both predicting class-attribute associations and unsupervised ZSL by a large margin.
Naming TV Characters by Watching and Analyzing Dialogs
Monica-Laura Haurilet, Makarand Tapaswi, Ziad Al-Halah and Rainer Stiefelhagen
IEEE Winter Conference on Applications of Computer Vision
WACV
),
Lake Placid, NY, USA,
March 2016.
PDF
DOI
Slides
Poster
@inproceedings{Haurilet2016, author = {Monica-Laura Haurilet and Makarand Tapaswi and Ziad Al-Halah and Rainer Stiefelhagen},
title = {{Naming TV Characters by Watching and Analyzing Dialogs}},
year = {2016},
booktitle = {IEEE Winter Conference on Applications of Computer Vision (WACV)},
month = {March},
doi = {10.1109/WACV.2016.7477560},
Person identification in TV series has been a popular research topic over the last decade. In this area, most approaches either use manually annotated data or extract character supervision from a combination of subtitles and transcripts. However, both approaches have key drawbacks that hinder application of these methods at a large scale – manual annotation is expensive and transcripts are often hard to obtain. We investigate the topic of automatically labeling all character appearances in TV series using information obtained solely from subtitles. This task is extremely difficult as the dialogs between characters provide very sparse and weakly supervised data. We address these challenges by exploiting recent advances in face descriptors and Multiple Instance Learning methods. We propose methods to create MIL bags and evaluate and discuss several MIL techniques. The best combination achieves an average precision over 80% on three diverse TV series. We demonstrate that only using subtitles provides good results on identifying characters in TV series and wish to encourage the community towards this problem
Transfer Metric Learning for Action Similarity using High-Level Semantics
Ziad Al-Halah, Lukas Rybok and Rainer Stiefelhagen
Pattern Recognition Letters
PRL
),
Jul. 2015.
PDF
DOI
@inproceedings{Al-Halah2015d, author = {Ziad Al-Halah and Lukas Rybok and Rainer Stiefelhagen},
title = {{Transfer Metric Learning for Action Similarity using High-Level Semantics}},
year = {2015},
booktitle = {Pattern Recognition Letters (PRL)},
month = {Jul.},
doi = {10.1016/j.patrec.2015.07.005},
The goal of transfer learning is to exploit previous experiences and knowledge in order to improve learning in a novel domain. This is especially beneficial for the challenging task of learning classifiers that generalize well when only few training examples are available. In such a case, knowledge transfer methods can help to compensate for the lack of data. The performance and robustness against negative transfer of these approaches is influenced by the interdependence between knowledge representation and transfer type. However, this important point is usually neglected in the literature; instead the focus lies on either of the two aspects. In contrast, we study in this work the effect of various high-level semantic knowledge representations on different transfer types in a novel generic transfer metric learning framework. Furthermore, we introduce a hierarchical knowledge representation model based on the embedded structure in the semantic attribute space. The evaluation of the framework on challenging transfer settings in the context of action similarity demonstrates the effectiveness of our approach compared to state-of-the-art.
Hierarchical Transfer of Semantic Attributes
Ziad Al-Halah and Rainer Stiefelhagen
IEEE Computer Vision and Pattern Recognition Workshop on Fine-Grained Visual Categorization
CVPR
),
Boston, USA,
Jun. 2015.
PDF
arXiv
aPaY: CNN-M2k_Feat
aPaY: GoogLeNet_Feat
AwA: CNN-M2k_Feat
AwA: GoogLeNet_Feat
CUB: CNN-M2k_Feat
CUB: GoogLeNet_Feat
CUB: ZSL Test Split
@inproceedings{Al-Halah2015c, author = {Ziad Al-Halah and Rainer Stiefelhagen},
title = {{Hierarchical Transfer of Semantic Attributes}},
year = {2015},
booktitle = {IEEE Computer Vision and Pattern Recognition Workshop on Fine-Grained Visual Categorization (CVPR)},
month = {Jun.},
doi = {},
arxivId = {1604.00326},
Accio: A Data Set for Face Track Retrieval in Movies Across Age
Esam Ghaleb, Makarand Tapaswi, Ziad Al-Halah, Hazım Kemal Ekenel and Rainer Stiefelhagen
ACM International Conference on Multimedia Retrieval
ICMR
),
Shanghai, China,
Jun. 2015.
Short paper, Poster.
PDF
DOI
@inproceedings{Ghaleb2015_Accio, author = {Esam Ghaleb and Makarand Tapaswi and Ziad Al-Halah and Hazım Kemal Ekenel and Rainer Stiefelhagen},
title = {{Accio: A Data Set for Face Track Retrieval in Movies Across Age}},
year = {2015},
booktitle = {ACM International Conference on Multimedia Retrieval (ICMR)},
month = {Jun.},
doi = {10.1145/2671188.2749296},
Video face recognition is a very popular task and has come a long way. The primary challenges such as illumination, resolution and pose are well studied through multiple data sets. However there are no video-based data sets dedicated to study the effects of aging on facial appearance. We present a challenging face track data set, Harry Potter Movies Aging Data set (Accio), to study and develop age invariant face recognition methods for videos. Our data set not only has strong challenges of pose, illumination and distractors, but also spans a period of ten years providing substantial variation in facial appearance. We propose two primary tasks: within and across movie face track retrieval; and two protocols which differ in their freedom to use external data. We present baseline results for the retrieval performance using a state-of-the-art face track descriptor. Our experiments show clear trends of reduction in performance as the age gap between the query and database increases. We will make the data set publicly available for further exploration in age-invariant video face recognition.
Action Unit Intensity Estimation using Hierarchical Partial Least Squares
Tobias Gehrig*, Ziad Al-Halah*, Hazım Kemal Ekenel and Rainer Stiefelhagen
IEEE International Conference on Automatic Face and Gesture Recognition
FG
),
Ljubljana, Slovenia,
May 2015.
= equal contribution)
Oral
, Acceptance rate=12.2%.
PDF
DOI
@inproceedings{Gehrig2015_hPLS, author = {Tobias Gehrig* and Ziad Al-Halah* and Hazım Kemal Ekenel and Rainer Stiefelhagen},
title = {{Action Unit Intensity Estimation using Hierarchical Partial Least Squares}},
year = {2015},
booktitle = {IEEE International Conference on Automatic Face and Gesture Recognition (FG)},
month = {May},
doi = {10.1109/FG.2015.7163152},
Estimation of action unit (AU) intensities is considered a challenging problem. AUs exhibit high variations among the subjects due to the differences in facial plasticity and morphology. In this paper, we propose a novel framework to model the individual AUs using a hierarchical regression model. Our approach can be seen as a combination of locally linear Partial Least Squares (PLS) models where each one of them learns the relation between visual features and the AU intensity labels at different levels of details. It automatically adapts to the non-linearity in the source domain by adjusting the learned hierarchical structure. We evaluate our approach on the benchmark of the Bosphorus dataset and show that the proposed approach outperforms both the 2D state-of-the-art and the plain PLS baseline models. The generalization to other datasets is evaluated on the extended Cohn-Kanade dataset (CK+), where our hierarchical model outperforms linear and Gaussian kernel PLS.
How to Transfer? Zero-Shot Object Recognition via Hierarchical Transfer of Semantic Attributes
Ziad Al-Halah and Rainer Stiefelhagen
IEEE Winter Conference on Applications of Computer Vision
WACV
),
Waikoloa Beach, HI, USA,
Jan. 2015.
Oral+Poster, Acceptance rate=36.7%,
Winner of ICVSS 2015 Best Presentation Award
PDF
DOI
arXiv
aPaY: CNN-M2k_Feat
aPaY: GoogLeNet_Feat
AwA: CNN-M2k_Feat
AwA: GoogLeNet_Feat
CUB: CNN-M2k_Feat
CUB: GoogLeNet_Feat
CUB: ZSL Test Split
@inproceedings{Al-Halah2015_how2transfer, author = {Ziad Al-Halah and Rainer Stiefelhagen},
title = {{How to Transfer? Zero-Shot Object Recognition via Hierarchical Transfer of Semantic Attributes}},
year = {2015},
booktitle = {IEEE Winter Conference on Applications of Computer Vision (WACV)},
month = {Jan.},
doi = {10.1109/WACV.2015.116},
arxivId = {1604.00326},
Attribute based knowledge transfer has proven very successful in visual object analysis and learning previously un-seen classes. However, the common approach learns and transfers attributes without taking into consideration the embedded structure between the categories in the source set. Such information provides important cues on the intra-attribute variations. We propose to capture these variations in a hierarchical model that expands the knowledge source with additional abstraction levels of attributes. We also provide a novel transfer approach that can choose the appropriate attributes to be shared with an unseen class. We evaluate our approach on three public datasets: aPascal, Animals with Attributes and CUB-200-2011 Birds. The experiments demonstrate the effectiveness of our model with significant improvement over state-of-the-art.
What to Transfer? High-Level Semantics in Transfer Metric Learning for Action Similarity
Ziad Al-Halah, Lukas Rybok and Rainer Stiefelhagen
International Conference on Pattern Recognition
ICPR
),
Stockholm, Sweden,
Aug. 2014.
Oral
, Acceptance rate=14.1%,
Best Student Paper Award
PDF
DOI
@inproceedings{Al-Halah2014_what2transfer, author = {Ziad Al-Halah and Lukas Rybok and Rainer Stiefelhagen},
title = {{What to Transfer? High-Level Semantics in Transfer Metric Learning for Action Similarity}},
year = {2014},
booktitle = {International Conference on Pattern Recognition (ICPR)},
month = {Aug.},
doi = {10.1109/ICPR.2014.478},
Learning from few examples is considered a very challenging task where transfer learning proved to be beneficial. Such a learning framework exploits previous experiences and knowledge to compensate for the lack of training data in a novel domain. Knowledge representation plays a vital role in the type and performance of transfer learning approaches, as well as its robustness against negative transfer effect. This aspect is usually not considered in most of the proposed transfer learning methodologies, where the focus is either on the transfer type or on the representation. In this work, we study the use of various high-level semantics in transfer metric learning. We propose a generic transfer metric learning framework, and analyze the effect of different semantic similarity spaces on transfer type and efficiency against negative transfer. Furthermore, we introduce a hierarchical knowledge representation model based on the embedded structure in the attribute semantic space. The evaluation of the framework on challenging transfer settings in the context of action similarity demonstrates the effectiveness of our approach.
Important Stuff, Everywhere! Activity Recognition with Salient Proto-Objects as Context
Lukas Rybok, Boris Schauerte, Ziad Al-Halah and Rainer Stiefelhagen
IEEE Winter Conference on Applications of Computer Vision
WACV
),
Steamboat Springs CO, USA,
Mar. 2014.
Oral+Poster, Acceptance rate=40%.
PDF
DOI
@inproceedings{Rybok2014_protoobj, author = {Lukas Rybok and Boris Schauerte and Ziad Al-Halah and Rainer Stiefelhagen},
title = {{Important Stuff, Everywhere! Activity Recognition with Salient Proto-Objects as Context}},
year = {2014},
booktitle = {IEEE Winter Conference on Applications of Computer Vision (WACV)},
month = {Mar.},
doi = {10.1109/WACV.2014.6836041},
Object information is an important cue to discriminate between activities that draw part of their meaning from context. Most of current work either ignores this information or relies on specific object detectors. However, such object detectors require a significant amount of training data and complicate the transfer of the action recognition framework to novel domains with different objects and object-action relationships. Motivated by recent advances in saliency detection, we propose to employ salient proto-objects for unsupervised discovery of object- and object-part candidates and use them as a contextual cue for activity recognition. Our experimental evaluation on three publicly available data sets shows that the integration of proto-objects and simple motion features substantially improves recognition performance, outperforming the state-of-the-art.
Learning Semantic Attributes via a Common Latent Space
Ziad Al-Halah, Tobias Gehrig and Rainer Stiefelhagen
International Conference On Computer Vision Theory and Applications
VISAPP
),
Lisbon, Portugal,
Jan. 2014.
Oral
, Full paper, Acceptance rate=17.2%.
PDF
DOI
@inproceedings{Al-Halah2014_attributes, author = {Ziad Al-Halah and Tobias Gehrig and Rainer Stiefelhagen},
title = {{Learning Semantic Attributes via a Common Latent Space}},
year = {2014},
booktitle = {International Conference On Computer Vision Theory and Applications (VISAPP)},
month = {Jan.},
doi = {10.5220/0004681500480055},
Semantic attributes represent an adequate knowledge that can be easily transferred to other domains where lack of information and training samples exist. However, in the classical object recognition case, where training data is abundant, attribute-based recognition usually results in poor performance compared to methods that used image features directly. We introduce a generic framework that boosts the performance of semantic attributes considerably in traditional classification and knowledge transfer tasks, such as zero-shot learning. It incorporates the discriminative power of the visual features and the semantic meaning of the attributes by learning a common latent space that joins both spaces. We also specifically account for the presence of attribute correlations in the source dataset to generalize more efficiently across domains. Our evaluation of the proposed approach on standard public datasets shows that it is not only simple and computationally efficient but also performs remarkably better than the common direct attribute model.