본문 바로가기

AI/STT, TTS11

VibeVoice : 최첨단 오픈 소스 텍스트-음성 변환모델 VibeVoice는 텍스트에서 팟캐스트와 같이 표현력이 풍부하고 장문의 다중 화자 대화 오디오를 생성하도록 설계된 혁신적인 프레임워크입니다. 기존 텍스트 음성 변환(TTS) 시스템의 주요 과제, 특히 확장성, 화자 일관성, 자연스러운 턴테이킹(turn-taking) 문제를 해결합니다. VibeVoice의 핵심 혁신은 7.5Hz의 초저 프레임 속도로 작동하는 연속 음성 토크나이저(음향 및 의미 토크나이저)를 사용하는 데 있습니다. 이 토크나이저는 긴 시퀀스 처리 시 연산 효율을 크게 향상하는 동시에 오디오 충실도를 효과적으로 유지합니다. VibeVoice는 대규모 언어 모델(LLM)을 활용하여 텍스트 맥락과 대화 흐름을 이해하고, 확산 헤드를 통해 고충실도 음향 디테일을 생성하는 차세대 토큰 확산 프레임워.. 2025. 9. 7.

sesame.com 사람과 같은 대화 수준 https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice#demo Crossing the uncanny valley of conversational voiceAt Sesame, our goal is to achieve “voice presence”—the magical quality that makes spoken interactions feel real, understood, and valued.www.sesame.com이 저장소는 단어 수준 타임스탬프와 화자 일기 기능을 사용해 빠른 자동 음성 인식(대형 v2에서 70배 실시간) 기능을 제공합니다.⚡️ whisper large-v2를 사용하여 70배 실시간 전사를 위한 일괄 추론🪶 f.. 2025. 3. 10.

PaliGemma PaliGemmahttps://ai.google.dev/gemma/docs/paligemma?hl=ko PaliGemma | Google for Developers이 페이지는 Cloud Translation API를 통해 번역되었습니다. 의견 보내기 컬렉션을 사용해 정리하기 내 환경설정을 기준으로 콘텐츠를 저장하고 분류하세요. PaliGemma 자세히 알아보기 달리 명시되지ai.google.dev PaliGemma는 PaLI-3에서 영감을 받아 SigLIP 비전 모델 및 Gemma 언어 모델과 같은 개방형 구성요소를 기반으로 하는 경량의 개방형 비전 언어 모델 (VLM)입니다. PaliGemma는 이미지와 텍스트를 모두 입력으로 사용하며 세부정보와 컨텍스트가 있는 이미지 관련 질문에 답변할 수 있습니.. 2024. 5. 19.

OpenVoice 입력한 목소리로 TTS 처리해줌. 논문 : https://arxiv.org/abs/2312.01479 소스: https://github.com/myshell-ai/OpenVoice 웹 : https://research.myshell.ai/open-voice Open Voice OpenVoice: Versatile Instant Voice Cloning We introduce OpenVoice, a versatile instant voice cloning approach that requires only a short audio clip from the reference speaker to replicate their voice and generate speech in multiple languages. .. 2024. 4. 1.

이전 1 2 3 다음

티스토리툴바