Follow
Puyuan  Peng
Title
Cited by
Cited by
Year
MAE-AST: Masked Autoencoding Audio Spectrogram Transformer
A Baade, P Peng, D Harwath
Interspeech 2022, 2022
732022
Word discovery in visually grounded, self-supervised speech models
P Peng, D Harwath
Interspeech 2022, 2022
312022
Fast-slow transformer for visually grounding speech
P Peng, D Harwath
ICASSP 2022, 2022
262022
Self-supervised representation learning for speech using visual grounding and masked language modeling
P Peng, D Harwath
AAAI 2022 SAS Workshop, 2022
242022
Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization
P Peng, B Yan, S Watanabe, D Harwath
Interspeech 2023, 2023
182023
A correspondence variational autoencoder for unsupervised acoustic word embeddings
P Peng, H Kamper, K Livescu
NeurIPS 2020 SAS Workshop, 2020
152020
Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model
P Peng, SW Li, O Räsänen, A Mohamed, D Harwath
Interspeech 2023, 2023
32023
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
Y Tseng, L Berry*, YT Chen*, I Chiu*, HH Lin*, M Liu*, P Peng*, YJ Shih*, ...
preprint, 2023
22023
Zero-shot Video Moment Retrieval With Off-the-Shelf Models
A Diwan*, P Peng*, RJ Mooney (* denotes equal contribution)
NeurIPS 2022 TL4NLP, 2022
22022
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
P Peng, PY Huang, D Li, A Mohamed, D Harwath
arXiv preprint arXiv:2403.16973, 2024
12024
Audio-Visual Neural Syntax Acquisition
CIJ Lai*, F Shi*, P Peng*, Y Kim, K Gimpel, S Chang, YS Chuang, S Bhati, ...
ASRU 2023, 2023
12023
Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos
C Hori, P Peng, D Harwath, X Liu, K Ota, S Jain, R Corcodel, D Jha, ...
Interspeech 2023, 2023
12023
Textless phrase structure induction from visually-grounded speech
CI Lai, F Shi, P Peng, Y Kim, K Gimpel, S Chang, YS Chuang, S Bhati, ...
12022
SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data
HF Wang, YJ Shih, HJ Chang, L Berry, P Peng, H Lee, HM Wang, ...
arXiv preprint arXiv:2402.06959, 2024
2024
Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model
HC Fang, NX Ye, YJ Shih, P Peng, HF Wang, L Berry, H Lee, D Harwath
arXiv preprint arXiv:2402.05819, 2024
2024
BAT: Learning to Reason about Spatial Sounds with Large Language Models
Z Zheng, P Peng, Z Ma, X Chen, E Choi, D Harwath
arXiv preprint arXiv:2402.01591, 2024
2024
The system can't perform the operation now. Try again later.
Articles 1–16