تولید روایت صوتی برای ویدیو

text-to-speech, video-processing, vision

پردازش ویدیو با استفاده از GPT-4-Vision برای تولید متن مناسب و صداگذاری روی آن #

این Notebook نشان می‌دهد چگونه می‌توان از توانایی‌های بصری GPT-4 برای درک محتوای یک ویدیو و تولید متن متناسب با آن و نهایتا تبدیل متن تولید شده به صدا استفاده کرد. GPT-4 به طور مستقیم ویدیوها را به عنوان ورودی قبول نمی‌کند، اما می‌توانیم از قابلیت vision و طول کانتکست 128K برای توصیف فریم‌های ثابت یک ویدیو در هر زمان استفاده کنیم. این راهنما شامل دو مرحله است:

۱- استفاده از GPT-4 برای دریافت توصیفی از محتوای ویدیو به صورت متن

۲- تولید صدای روایت از روی متن تولید شده با استفاده از GPT-4 و TTS API

برای اجرای کدهای زیر ابتدا باید یک کلید API را از طریق پنل کاربری گیلاس تولید کنید. برای این کار ابتدا یک حساب کاربری جدید بسازید یا اگر صاحب حساب کاربری هستید وارد پنل کاربری خود شوید. سپس، به صفحه کلید API بروید و با کلیک روی دکمه “ساخت کلید API” یک کلید جدید برای دسترسی به Gilas API بسازید.

 1from IPython.display import display, Image, Audio
 2
 3import cv2  # We're using OpenCV to read video, to install !pip install opencv-python
 4import base64
 5import time
 6from openai import OpenAI
 7import os
 8import requests
 9
10client = OpenAI(
11   api_key=os.environ.get(("GILAS_API_KEY", "<کلید API خود را اینجا بسازید https://dashboard.gilas.io/apiKey>")), 
12   base_url="https://api.gilas.io/v1/" # Gilas APIs
13)

استفاده از توانایی‌های بصری GPT برای دریافت توصیفی از محتوای ویدیو #

ما از OpenCV برای استخراج فریم‌های یک ویدیوی آزمایشی با محتوای طبیعت که نمایشی از شکار یک بیزانس توسط گرگ‌ها را به تصویر می‌کشد استفاده می‌کنیم:

 1video = cv2.VideoCapture("data/bison.mp4")
 2
 3base64Frames = []
 4while video.isOpened():
 5    success, frame = video.read()
 6    if not success:
 7        break
 8    _, buffer = cv2.imencode(".jpg", frame)
 9    base64Frames.append(base64.b64encode(buffer).decode("utf-8"))
10
11video.release()
12print(len(base64Frames), "frames read.")

خروجی:

1618 frames read.

فریم‌ها را نمایش می‌دهیم تا مطمئن شویم که آن‌ها را به درستی خوانده‌ایم:

1display_handle = display(None, display_id=True)
2for img in base64Frames:
3    display_handle.update(Image(data=base64.b64decode(img.encode("utf-8"))))
4    time.sleep(0.025)

خروجی:

حال که فریم‌های ویدیو را داریم، می‌توانیم آنها را از طریق Gilas API به GPT-4-Vision ارسال کنیم تا متن متناسب با آنها تولید شود. (توجه داشته باشید که نیازی به ارسال همه‌ی فریم‌ها برای درک GPT از اتفاقات در حال رخداد نیست):

 1PROMPT_MESSAGES = [
 2    {
 3        "role": "user",
 4        "content": [
 5            "These are frames from a video that I want to upload. Generate a compelling description that I can upload along with the video.",
 6            *map(lambda x: {"image": x, "resize": 768}, base64Frames[0::50]),
 7        ],
 8    },
 9]
10params = {
11    "model": "gpt-4-turbo",
12    "messages": PROMPT_MESSAGES,
13    "max_tokens": 200,
14}
15
16result = client.chat.completions.create(**params)
17print(result.choices[0].message.content)

خروجی:

"🐺 Survival of the Fittest: An Epic Tale in the Snow ❄️ - Witness the intense drama of nature as a pack of wolves face off against mighty bison in a harsh winter landscape. This raw footage captures the essence of the wild where every creature fights for survival. With each frame, experience the tension, the strategy, and the sheer force exerted in this life-or-death struggle. See nature's true colors in this gripping encounter on the snowy plains. 🦬"

Remember to respect wildlife and nature. This video may contain scenes that some viewers might find intense or distressing, but they depict natural animal behaviors important for ecological studies and understanding the reality of life in the wilderness.

توجه: ورودی‌های داده شده و خروجی‌های تولید شده توسط مدل در این مثال به زبان انگلیسی هستند. برای تولید خروجی به زبان فارسی٬ کافی‌ست از مدل بخواهید که خروجی را به زبان فارسی تولید کند.

تولید صدای روایت ویدیو با استفاده از GPT-4 و TTS API #

حال که متن روایت آماده شده است می‌خواهیم آن را به سبک David Attenborough تبدیل کنیم. با استفاده از همان فریم‌های ویدیو، از GPT می‌خواهیم تا یک فیلمنامه کوتاه به ما بدهد:

 1PROMPT_MESSAGES = [
 2    {
 3        "role": "user",
 4        "content": [
 5            "These are frames of a video. Create a short voiceover script in the style of David Attenborough. Only include the narration.",
 6            *map(lambda x: {"image": x, "resize": 768}, base64Frames[0::60]),
 7        ],
 8    },
 9]
10params = {
11    "model": "gpt-4-turbo",
12    "messages": PROMPT_MESSAGES,
13    "max_tokens": 500,
14}
15
16result = client.chat.completions.create(**params)
17print(result.choices[0].message.content)

خروجی:

In the vast, white expanse of the northern wilderness, a drama as old as time unfolds. Here, amidst the silence of the snow, the wolf pack circles, their breaths visible as they cautiously approach their formidable quarry, the bison. These wolves are practiced hunters, moving with strategic precision, yet the bison, a titan of strength, stands resolute, a force to be reckoned with.
As tension crackles in the frozen air, the wolves close in, their eyes locked on their target. The bison, wary of every movement, prepares to defend its life. It's a perilous dance between predator and prey, where each step could be the difference between life and death.
In an instant, the quiet of the icy landscape is shattered. The bison charges, a desperate bid for survival as the pack swarms. The wolves are relentless, each one aware that their success depends on the strength of the collective. The bison, though powerful, is outnumbered, its massive form stirring up clouds of snow as it struggles.
It's an epic battle, a testament to the harsh realities of nature. In these moments, there is no room for error, for either side. The wolves, agile and tenacious, work in unison, their bites a chorus aiming to bring down the great beast. The bison, its every heaving breath a testament to its will to survive, fights fiercely, but the odds are not in its favor. 
With the setting sun casting long shadows over the snow, the outcome is inevitable. Nature, in all its raw beauty and brutality, does not show favor. The wolves, now victors, gather around their prize, their survival in this harsh climate secured for a moment longer. It's a poignant reminder of the circle of life that rules this pristine wilderness, a reminder that every creature plays its part in the enduring saga of the natural world.

حال می‌توانیم فیلمنامه را با استفاده از TTS API به صدا تبدیل کنیم. برای این کار کافی است که یک درخواست به اندپوینت https://api.gilas.io/v1/audio/speech بفرستیم:

 1response = requests.post(
 2    "https://api.gilas.io/v1/audio/speech",
 3    headers={
 4        "Authorization": f"Bearer {os.environ['GILAS_API_KEY']}",
 5    },
 6    json={
 7        "model": "tts-1-1106",
 8        "input": result.choices[0].message.content,
 9        "voice": "onyx",
10    },
11)
12
13audio = b""
14for chunk in response.iter_content(chunk_size=1024 * 1024):
15    audio += chunk
16Audio(audio)

خروجی: