Alert button
Picture for David Harwath

David Harwath

Alert button

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

Add code
Bookmark button
Alert button
Apr 08, 2024
Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman

Viaarxiv icon

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

Add code
Bookmark button
Alert button
Mar 25, 2024
Puyuan Peng, Po-Yao Huang, Daniel Li, Abdelrahman Mohamed, David Harwath

Viaarxiv icon

SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data

Add code
Bookmark button
Alert button
Feb 10, 2024
Hsuan-Fu Wang, Yi-Jen Shih, Heng-Jui Chang, Layne Berry, Puyuan Peng, Hung-yi Lee, Hsin-Min Wang, David Harwath

Viaarxiv icon

Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model

Add code
Bookmark button
Alert button
Feb 08, 2024
Hung-Chieh Fang, Nai-Xuan Ye, Yi-Jen Shih, Puyuan Peng, Hsuan-Fu Wang, Layne Berry, Hung-yi Lee, David Harwath

Viaarxiv icon

BAT: Learning to Reason about Spatial Sounds with Large Language Models

Add code
Bookmark button
Alert button
Feb 02, 2024
Zhisheng Zheng, Puyuan Peng, Ziyang Ma, Xie Chen, Eunsol Choi, David Harwath

Viaarxiv icon

Audio-Visual Neural Syntax Acquisition

Add code
Bookmark button
Alert button
Oct 11, 2023
Cheng-I Jeff Lai, Freda Shi, Puyuan Peng, Yoon Kim, Kevin Gimpel, Shiyu Chang, Yung-Sung Chuang, Saurabhchand Bhati, David Cox, David Harwath, Yang Zhang, Karen Livescu, James Glass

Figure 1 for Audio-Visual Neural Syntax Acquisition
Figure 2 for Audio-Visual Neural Syntax Acquisition
Figure 3 for Audio-Visual Neural Syntax Acquisition
Figure 4 for Audio-Visual Neural Syntax Acquisition
Viaarxiv icon

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

Add code
Bookmark button
Alert button
Sep 19, 2023
Yuan Tseng, Layne Berry, Yi-Ting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe, Abdelrahman Mohamed, Chi-Luen Feng, Hung-yi Lee

Figure 1 for AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
Figure 2 for AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
Figure 3 for AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
Viaarxiv icon

Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos

Add code
Bookmark button
Alert button
Jun 27, 2023
Chiori Hori, Puyuan Peng, David Harwath, Xinyu Liu, Kei Ota, Siddarth Jain, Radu Corcodel, Devesh Jha, Diego Romeres, Jonathan Le Roux

Figure 1 for Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos
Figure 2 for Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos
Figure 3 for Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos
Figure 4 for Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos
Viaarxiv icon

When to Use Efficient Self Attention? Profiling Text, Speech and Image Transformer Variants

Add code
Bookmark button
Alert button
Jun 14, 2023
Anuj Diwan, Eunsol Choi, David Harwath

Figure 1 for When to Use Efficient Self Attention? Profiling Text, Speech and Image Transformer Variants
Figure 2 for When to Use Efficient Self Attention? Profiling Text, Speech and Image Transformer Variants
Figure 3 for When to Use Efficient Self Attention? Profiling Text, Speech and Image Transformer Variants
Figure 4 for When to Use Efficient Self Attention? Profiling Text, Speech and Image Transformer Variants
Viaarxiv icon

Unit-based Speech-to-Speech Translation Without Parallel Data

Add code
Bookmark button
Alert button
May 24, 2023
Anuj Diwan, Anirudh Srinivasan, David Harwath, Eunsol Choi

Figure 1 for Unit-based Speech-to-Speech Translation Without Parallel Data
Figure 2 for Unit-based Speech-to-Speech Translation Without Parallel Data
Figure 3 for Unit-based Speech-to-Speech Translation Without Parallel Data
Figure 4 for Unit-based Speech-to-Speech Translation Without Parallel Data
Viaarxiv icon