Lip-Sync a Video Utilizing Opensource Instruments

Introduction

AI voice-cloning has taken social media by storm. It has opened a world of artistic potentialities. You have to have seen memes or AI voice-overs of well-known personalities on social media. Have you ever puzzled how it’s completed? Positive, many platforms present APIs like Eleven Labs, however can we do it without spending a dime, utilizing open-source software program? The quick reply is YES. The open-source has TTS fashions and lip-syncing instruments to attain voice synthesis. So, on this article, we’ll discover open-source instruments and fashions for voice-cloning and lip-syncing.

AI voice cloning and lip syncing using open-source tools

Studying Goals

Discover open-source instruments for AI voice-cloning and lip-syncing.
Use FFmpeg and Whisper to transcribe movies.
Use the Coqui-AI’s xTTS mannequin to clone voice.
Use the Wav2Lip for lip-syncing movies.
Discover real-world use circumstances of this expertise.

This text was printed as part of the Knowledge Science Blogathon.

Open-Supply Stack

As you already know, we’ll use OpenAI’s Whisper, FFmpeg, Coqui-ai’s xTTS mannequin, and Wav2lip as our tech stack. However earlier than delving into the codes, let’s briefly talk about these instruments. And likewise because of the authors of those initiatives.

Whisper: Whisper is OpenAI’s ASR (Automated Speech Recognition) mannequin. It’s an encoder-decoder transformer mannequin educated with over 650k hours of various audio knowledge and corresponding transcripts. Thus making it very potent at a multi-lingual transcription from audio.

The encoders obtain the log-mel spectrogram of 30-second chunks of audio. Every encoder block makes use of self-attention to grasp completely different components of audio alerts. The decoder then receives hidden state data from encoders and realized positional encodings. The decoder makes use of self-attention and cross-attention to foretell the following token. On the finish of the method, it outputs a sequence of tokens representing the acknowledged textual content. For extra on Whisper, discuss with the official repository.

Coqui TTS: TTS is an open-source library from Coqui-ai. It hosts a number of text-to-speech fashions. It has end-to-end fashions like Bark, Tortoise, and xTTS, spectrogram fashions like Glow-TTS, FastSpeech, and so on, and Vocoders like Hifi-GAN, MelGAN, and so on. Furthermore, it supplies a unified API for inferencing, fine-tuning, and coaching text-to-speech fashions. On this mission, we’ll use xTTS, an end-to-end multi-lingual voice-cloning mannequin. It helps 16 languages, together with English, Japanese, Hindi, Mandarin, and so on. For extra details about the TTS, discuss with the official TTS repository.

Wav2Lip: Wav2lip is a Python repository for the paper “A Lip Sync Professional Is All You Want for Speech to Lip Technology Within the Wild.” It makes use of a lip-sync discriminator to acknowledge face and lip actions. This works out nice for dubbing voices. For extra data, discuss with the official repository. We’ll use this forked repository of Wav2lip.

Workflow

Now that we’re accustomed to the instruments and fashions we’ll use, let’s perceive the workflow. This can be a easy workflow. So, here’s what we’ll do.

Add a video to the Colab runtime and resize it to 720p format for higher lip-syncing.
Use FFmpeg to extract 24-bit audio from the video and use Whisper to transcribe the audio file.
Use Google Translate or an LLM to translate the transcribed script to a different language.
Load the Multi-lingual xTTS mannequin with the TTS library and cross the script and reference audio mannequin for voice synthesis.
Clone the Wav2lip repository and obtain mannequin checkpoints. Run the inference.py file to sync the unique video with synthesized audio.

Now, let’s delve into the codes.

Step 1: Set up Dependencies

This mission would require vital RAM and GPU consumption, so it’s prudent to make use of a Colab runtime. The free tier Colab supplies 12GB of CPU and 15GB of T4 GPU. This needs to be sufficient for this mission. So, head over to your Colab and hook up with a GPU runtime.

Now, set up the TTS and Whisper.

!pip set up TTS
!pip set up git+https://github.com/openai/whisper.git

Step 2: Add Movies to Colab

Now, we’ll add a video and resize it to 720p format. The Wav2lip tends to carry out higher when the movies are in 720p format. This may be completed utilizing FFmpeg.

#@title Add Video

from google.colab import information
import os
import subprocess

uploaded = None
resize_to_720p = False

def upload_video():
  world uploaded
  world video_path  # Declare video_path as world to change it
  uploaded = information.add()
  for filename in uploaded.keys():
    print(f'Uploaded {filename}')
    if resize_to_720p:
        filename = resize_video(filename)  # Get the identify of the resized video
    video_path = filename  # Replace video_path with both unique or resized filename
    return filename


def resize_video(filename):
    output_filename = f"resized_{filename}"
    cmd = f"ffmpeg -i {filename} -vf 'scale=-1:720' {output_filename}"
    subprocess.run(cmd, shell=True)
    print(f'Resized video saved as {output_filename}')
    return output_filename

# Create a type button that calls upload_video when clicked and a checkbox for resizing
import ipywidgets as widgets
from IPython.show import show

button = widgets.Button(description="Add Video")
checkbox = widgets.Checkbox(worth=False, description='Resize to 720p (higher outcomes)')
output = widgets.Output()

def on_button_clicked(b):
  with output:
    world video_path
    world resize_to_720p
    resize_to_720p = checkbox.worth
    video_path = upload_video()

button.on_click(on_button_clicked)
show(checkbox, button, output)

This can output a type button for importing movies from an area system and a checkbox for enabling 720p resizing. It’s also possible to add a video manually to the present collab session and resize it utilizing a subprocess.

Step 3: Audio Extraction and Whisper Transcription

Now that we have now our video, the following factor we’ll do is extract audio utilizing FFmpeg and use Whisper to transcribe.

# @title Audio extraction (24 bit) and whisper conversion
import subprocess

# Guarantee video_path variable exists and isn't None
if 'video_path' in globals() and video_path isn't None:
    ffmpeg_command = f"ffmpeg -i '{video_path}' -acodec pcm_s24le -ar 48000 -q:a 0 -map a
                       -y 'output_audio.wav'"
    subprocess.run(ffmpeg_command, shell=True)
else:
    print("No video uploaded. Please add a video first.")

import whisper

mannequin = whisper.load_model("base")
consequence = mannequin.transcribe("output_audio.wav")

whisper_text = consequence["text"]
whisper_language = consequence['language']

print("Whisper textual content:", whisper_text)

This can extract audio from the video in 24-bit format and can use the Whisper Base to transcribe it. For higher transcription, use Whisper small or medium fashions.

Step 4: Voice Synthesis

Now, to the voice cloning half. As I’ve talked about earlier than, we’ll use Coqui-ai’s xTTS mannequin. This is likely one of the greatest open-source fashions on the market for voice synthesis. Coqui-ai additionally supplies many TTS fashions for various functions; do test them. For our use case, which is voice-cloning, we’ll use the xTTS v2 mannequin.

Load the xTTS mannequin. This can be a huge mannequin with a dimension of 1.87 GB. So, it will take some time.

# @title Voice synthesis
from TTS.api import TTS
import torch
from IPython.show import Audio, show  # Import the Audio and show modules

system = "cuda" if torch.cuda.is_available() else "cpu"
# Initialize TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(system)

XTTS presently helps 16 languages. Listed here are the ISO codes of languages the xTTS mannequin helps.

print(tts.languages)


['en','es','fr','de','it','pt','pl','tr','ru','nl','cs','ar','zh-cn','hu','ko','ja','hi']

Word: Languages like English and French do not need a personality restrict, whereas Hindi has a personality restrict of 250. Few different languages may need the restrict as effectively.

For this mission, we’ll use the Hindi language, you possibly can experiment with others as effectively.

So, the very first thing we want now could be to translate the transcribed textual content into Hindi. This will both be completed by Google Translate bundle or utilizing an LLM. As per my observations, GPT-3.5-Turbo performs significantly better than Google Translate. We are able to use OpenAI API to get our translation.

import openai

consumer = openai.OpenAI(api_key = "api_key")
completion = consumer.chat.completions.create(
  mannequin="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": f"translate the texts to Hindi {whisper_text}"}
  ]
)
translated_text = completion.selections[0].message
print(translated_text)

As we all know, Hindi has a personality restrict, so we have to do textual content pre-processing earlier than passing it to the TTS mannequin. We have to break up the textual content into chunks of lower than 250 characters.

text_chunks = translated_text.break up(sep = "।")
final_chunks = [""]
for chunk in text_chunks:
  if not final_chunks[-1] or len(final_chunks[-1])+len(chunk)<250:
    chunk += "।"
    final_chunks[-1]+=chunk.strip()
  else:
    final_chunks.append(chunk+"।".strip())
final_chunks

This can be a quite simple splitter. You’ll be able to create a unique one or use Langchain’s recursive text-splitter. Now, we’ll cross every chunk to the TTS mannequin. The ensuing audio information can be merged utilizing FFmpeg.

def audio_synthesis(textual content, file_name):
  tts.tts_to_file(
      textual content,
      speaker_wav='output_audio.wav',
      file_path=file_name,
      language="hello"
  )
  return file_name
file_names = []
for i in vary(len(final_chunks)):
    file_name = audio_synthesis(final_chunks[i], f"output_synth_audio_{i}.wav")
    file_names.append(file_name)

As all of the information have the identical codec, we will simply merge them with FFmpeg. To do that, create a Txt file and add the file paths.

# it is a remark
file 'output_synth_audio_0.wav'
file 'output_synth_audio_1.wav'
file 'output_synth_audio_2.wav'

Now, run the code under to merge information.

import subprocess

cmd = "ffmpeg -f concat -safe 0 -i my_files.txt -c copy final_output_synth_audio_hi.wav"
subprocess.run(cmd, shell=True)

This can output the ultimate concatenated audio file. It’s also possible to play the audio in Colab.

from IPython.show import Audio, show
show(Audio(filename="final_output_synth_audio_hi.wav", autoplay=False))

Step 5: Lip-Syncing

Now, to the lip-syncing half. To lip-sync our artificial audio with the unique video, we’ll use the Wav2lip repository. To make use of Wav2lip to sync audio, we have to set up the mannequin checkpoints. However earlier than that, if you’re on T4 GPU runtime, delete the xTTS and Whisper fashions within the present Colab session or restart the session.

import torch

attempt:
    del tts
besides NameError:
    print("Voice mannequin already deleted")

attempt:
    del mannequin
besides NameError:
    print("Whisper mannequin  deleted")

torch.cuda.empty_cache()

Now, clone the Wav2lip repository and set up the checkpoints.

# @title Dependencies
%cd /content material/

!git clone https://github.com/justinjohn0306/Wav2Lip
!cd Wav2Lip && pip set up -r requirements_colab.txt

%cd /content material/Wav2Lip

!wget 'https://github.com/justinjohn0306/Wav2Lip/releases 
/obtain/fashions/wav2lip.pth' -O 'checkpoints/wav2lip.pth'

!wget 'https://github.com/justinjohn0306/Wav2Lip/releases 
/obtain/fashions/wav2lip_gan.pth' -O 'checkpoints/wav2lip_gan.pth'

!wget 'https://github.com/justinjohn0306/Wav2Lip/releases 
/obtain/fashions/mobilenet.pth' -O 'checkpoints/mobilenet.pth'

!pip set up batch-face

The Wav2lip has two fashions for lip-syncing. wav2lip and wav2lip_gan. Based on the authors of the fashions, the GAN mannequin requires much less effort in face detection however produces barely inferior outcomes. In distinction, the non-GAN mannequin can produce higher outcomes with extra guide padding and rescaling of the detection field. You’ll be able to check out each and see which one is doing higher.

Run the inference with the mannequin checkpoint path, video, and audio information.

%cd /content material/Wav2Lip

#That is the detection field padding, alter incase of poor outcomes. 
#Normally, the underside one is the largest problem
pad_top =  0
pad_bottom =  15
pad_left =  0
pad_right =  0
rescaleFactor =  1

video_path_fix = f"'../{video_path}'"

!python inference.py --checkpoint_path 'checkpoints/wav2lip_gan.pth' 
--face $video_path_fix --audio "/content material/final_output_synth_audio_hi.wav" 
--pads $pad_top $pad_bottom $pad_left $pad_right --resize_factor $rescaleFactor --nosmooth  
--outfile '/content material/output_video.mp4'

This can output a lip-synced video. If the video doesn’t look good, alter the parameters and retry.

So, right here is the repository for the pocket book and some samples.

GitHub Repository: sunilkumardash9/voice-clone-and-lip-sync

Actual-world Use Instances

Video voice-cloning and lip-syncing expertise have lots of use circumstances throughout industries. Listed here are a number of circumstances the place this may be useful.

Leisure: The leisure business would be the most affected business of all. We’re already witnessing the change. Voices of celebrities of present and bygone eras may be synthesized and re-used. This additionally poses moral challenges. Using synthesized voices needs to be completed responsively and inside the perimeter of legal guidelines.

Advertising and marketing: Personalised advert campaigns with acquainted and relatable voices can enormously improve model enchantment.

Communication: Language has all the time been a barrier to all types of actions. Cross-language communication remains to be a problem. Realtime end-to-end translation whereas preserving one’s accent and voice will revolutionize the best way we talk. This would possibly turn into a actuality in a number of years.

Content material Creation: Content material creators will not rely on translators to succeed in a much bigger viewers. With environment friendly voice cloning and lip-syncing, cross-language content material creation can be simpler. Podcasts and audiobook narration expertise may be enhanced with voice synthesis.

Conclusion

Voice synthesis is likely one of the most sought-after use circumstances of generative AI. It has the potential to revolutionize the best way we talk. Ever for the reason that creation of civilizations, the language barrier between communities has been a hurdle for forging deeper relationships, culturally and commercially. With AI voice synthesis, this hole may be crammed. So, on this article, we explored the open-source method of voice-cloning and lip-syncing.

Key Takeaways

TTS, a Python library by Coqui-ai, serves and maintains in style text-to-speech fashions.
The xTTS is a multi-lingual voice cloning mannequin able to cloning voice to 16 completely different languages.
Whisper is an ASR mannequin from OpenAI for environment friendly transcription and English translation.
Wav2lip is an open-source instrument for lip-syncing movies.
Voice cloning is likely one of the most occurring frontiers of generative AI, with a major potential influence on industries from leisure to advertising and marketing.

Continuously Requested Questions

Q1. Is AI voice cloning authorized?

A. Cloning voice could be unlawful because it infringes on copyright. Nonetheless, getting permission from the particular person earlier than cloning is the suitable strategy to go about it.

Q2. Is AI voice cloning free?

A. Most AI voice cloning API providers require charges. Nonetheless, some open-source fashions can provide pretty first rate voice synthesis functionality.

Q3. What’s the greatest voice cloning mannequin?

A. This is dependent upon explicit use circumstances. The xTTS mannequin is an effective alternative for multi-lingual voice synthesis. However for extra languages, Meta’s Fairseq fashions could be preferable.

This autumn. Can AI clone superstar voices?

A. Sure, it’s doable to clone the voice of a star. Nonetheless, be conscious that any potential misuse can land you in authorized bother.

Q5. What’s using voice cloning?

A. Voice cloning may be useful for a spread of use circumstances, equivalent to content material creation, narration in video games and flicks, Advert campaigns, and so on.

The media proven on this article isn’t owned by Analytics Vidhya and is used on the Writer’s discretion.