Our Technology

Following demos are showing our technologies and use cases

Control artificial
beings by text script

Anyone can write a script can control artificial beings and make them speak

Most existing artificial beings require voice actors, animators, or surrogate actors in the background to make them come alive. We are developing artificial intelligence technology to drive them through a text script.

Most existing artificial beings require voice actors, animators, or surrogate actors in the background to make them come alive. We are developing artificial intelligence technology to drive them through a text script.


Content expressions such as speech, face, gestures, etc are not simply mapped one-to-one from a given text. Human actor can act in different ways given a script. Our key technology is to generate speech and other expressions like professional actors by understanding various contextual information

Voice cloning and emotional speech synthesis

MBC Documentary “Meeting you

Generation of speech, facial expressions and gestures from a text script


Cross lingual voice cloning and dubbing

Moon and Kim's voice in English

Morgan Freeman speaks Korean

Singing voice synthesis

BTS Fake love by Trump’s voic

Research papers

Apr 3, 2021

Diff-TTS - A Denoising Diffusion Model for Text-to-Speech

Abstract: Although neural text-to-speech (TTS) models have attracted a lot of attention and succeeded in generating human-like speech, there is still room for improvements to its naturalness and architectural efficiency. In this work, we propose a novel non-autoregressive TTS model, namely Diff-TTS, which achieves highly natural and efficient speech synthesis. Given the text, Diff-TTS exploits a denoising diffusion framework to transform the noise signal into a mel-spectrogram via diffusion time steps. In order to learn the mel-spectrogram distribution conditioned on the text, we present a likelihood-based optimization method for TTS. Furthermore, to boost up the inference speed, we leverage the accelerated sampling method that allows Diff-TTS to generate raw waveforms much faster without significantly degrading perceptual quality. Through experiments, we verified that Diff-TTS generates 28 times faster than the real-time with a single NVIDIA 2080Ti GPU.
Oct 29, 2018

Large-scale Speaker Retrieval on Random Speaker-variability Subspace

Abstract: This paper describes a fast speaker search system to retrieve segments of the same voice identity in the large-scale data. Locality Sensitive Hashing (LSH) is a fast nearest neighbor search algorithm and the recent study shows that LSH enables quick retrieval of a relevant voice in the large-scale data in conjunction with i-vector while maintaining accuracy. In this paper, we proposed Random Speaker-variability Subspace (RSS) projection to map a data into hash tables. We hypothesized that rather than projecting on random subspace, projecting on speaker variability space would give more chance to put the same speaker representation into the same hash bins, so we can use less number of hash tables. We use Linear Discriminant Analysis (LDA) to generate speaker variability subspace projection matrix. Additionally, a random subset of the speaker in the training data was chosen for speaker label for LDA to produce multiple RSS. From the experimental result, the proposed approach shows 100 times and 7 times faster than the linear search and LSH, respectively.
Oct 29, 2018

Robust and Fine-grained Prosody Control of End-to-End Speech Synthesis

We propose prosody embeddings for emotional and expressive speech synthesis networks. The proposed methods introduce temporal structures in the embedding networks, which enable fine-grained control of the speaking style of the synthesized speech. The temporal structures could be designed either in speech-side or text-side, which lead different control resolution in time. The prosody embedding networks are plugged into end-to-end speech synthesis networks, and trained without any other supervision except the target speech for synthesizing. The prosody embedding networks learned to extract prosodic features. By adjusting the learned prosody features, we could change the pitch and amplitude of the synthesized speech both in frame level and phoneme level. We also introduce temporal normalization of prosody embeddings, which shows better robustness against speaker perturbation in prosody transfer tasks.
Jun 3, 2018

Voice Imitating Text-to-Speech Neural Networks

Abstract: We propose a neural text-to-speech (TTS) model that can imitate a new speaker's voice using only a small amount of speech sample. We demonstrate voice imitation using only a 6-seconds long speech sample without any other information such as transcripts. Our model also enables voice imitation instantly without additional training of the model. We implemented the voice imitating TTS model by combining a speaker embedder network with a state-of-the-art TTS model, Tacotron. The speaker embedder network takes a new speaker's speech sample and returns a speaker embedding. The speaker embedding with a target sentence are fed to Tacotron, and speech is generated with the new speaker's voice. We show that the speaker embeddings extracted by the speaker embedder network can represent the latent structure in different voices. The generated speech samples from our model have comparable voice quality to the ones from existing multi-speaker TTS models.
Jun 3, 2018

Learning Pronunciation from a Foreign Language in Speech Synthesis Networks

Abstract: Although there are more than 65,000 languages in the world, the pronunciations of many phonemes sound similar across the languages. When people learn a foreign language, their pronunciation often reflect their native language's characteristics. This motivates us to investigate how the speech synthesis network learns the pronunciation from datasets from different languages. In this study, we are interested in analyzing and taking advantage of multilingual speech synthesis network. First, we train the speech synthesis network bilingually in English and Korean, and analyze how the network learns the relations of phoneme pronunciation between the languages. Our experimental result shows that the learned phoneme embedding vectors are located closer if their pronunciations are similar across the languages. Consequently, the trained networks can synthesize English speaker's Korean speech and vice versa. Using this result, we propose a training framework to utilize information of different language. To be specific, we pre-train a speech synthesis network using dataset from both high-resource language and low-resource language, then we fine-tune the network using the low-resource language dataset. Finally, we conducted more simulations on 10 different languages to show it is generally extendable to other languages.
Nov 15, 2017

Emotional End-to-End Neural Speech Synthesizer

Abstract: In this paper, we introduce an emotional speech synthesizer based on the recent end-to-end neural model, named Tacotron. Despite its benefits, we found that the original Tacotron suffers from the exposure bias problem and irregularity of the attention alignment. Later, we address the problem by utilization of context vector and residual connection at recurrent neural networks (RNNs). Our experiments showed that the model could successfully train and generate speech for given emotion labels.

Make a difference, right now!

Save to home screen

Tap the icon at the bottom of your browser and choose “Add to Home Screen”