Picture for Puyuan Peng

Puyuan Peng

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Add code
Jun 13, 2024
Viaarxiv icon

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

Add code
Mar 25, 2024
Viaarxiv icon

SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data

Add code
Feb 10, 2024
Figure 1 for SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data
Figure 2 for SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data
Figure 3 for SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data
Figure 4 for SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data
Viaarxiv icon

Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model

Add code
Feb 08, 2024
Figure 1 for Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model
Figure 2 for Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model
Figure 3 for Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model
Figure 4 for Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model
Viaarxiv icon

BAT: Learning to Reason about Spatial Sounds with Large Language Models

Add code
Feb 02, 2024
Figure 1 for BAT: Learning to Reason about Spatial Sounds with Large Language Models
Figure 2 for BAT: Learning to Reason about Spatial Sounds with Large Language Models
Figure 3 for BAT: Learning to Reason about Spatial Sounds with Large Language Models
Figure 4 for BAT: Learning to Reason about Spatial Sounds with Large Language Models
Viaarxiv icon

Audio-Visual Neural Syntax Acquisition

Add code
Oct 11, 2023
Figure 1 for Audio-Visual Neural Syntax Acquisition
Figure 2 for Audio-Visual Neural Syntax Acquisition
Figure 3 for Audio-Visual Neural Syntax Acquisition
Figure 4 for Audio-Visual Neural Syntax Acquisition
Viaarxiv icon

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

Add code
Sep 19, 2023
Figure 1 for AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
Figure 2 for AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
Figure 3 for AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
Viaarxiv icon

Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos

Add code
Jun 27, 2023
Figure 1 for Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos
Figure 2 for Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos
Figure 3 for Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos
Figure 4 for Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos
Viaarxiv icon

Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Mode

Add code
May 19, 2023
Figure 1 for Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Mode
Figure 2 for Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Mode
Figure 3 for Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Mode
Figure 4 for Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Mode
Viaarxiv icon

Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization

Add code
May 18, 2023
Figure 1 for Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization
Figure 2 for Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization
Figure 3 for Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization
Figure 4 for Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization
Viaarxiv icon