🚀 From Simple Upload To Robust Pipeline: A Developer's Battle With Asynchronous AI

I was losing hours. Conference talks, technical tutorials, team meetings—my video backlog was growing, and I was struggling to extract the key insights efficiently. I thought to myself, "What if I could build a tool that not only summarizes a video but acts as an intelligent research assistant, capable of understanding the content and even searching the web for context?"

That simple question sparked a deep dive into the world of multimodal AI and agentic architectures. What followed was a journey filled with complex challenges, frustrating debugging sessions, and incredible "aha!" moments. This is the story of how I built the Video Summarizer, starting with the foundational pipeline that nearly stumped me.

The Blueprint: Architecting for an Intelligent Future

Before writing a single line of logic, I knew a solid architecture was essential. The goal wasn't just to cobble together a script, but to build a stable, scalable application.

My tech stack of choice was:

Streamlit: For its unparalleled speed in building beautiful, interactive Python web apps.
Agno (formerly phidata): As the "brain" or framework to orchestrate the AI agent and its tools.
Google Gemini: As the state-of-the-art multimodal model for understanding both video and text.

A clean project structure was my first priority to ensure the code remained maintainable.

Video-Summarizer/
├── .env.sample      # A template for environment variables
├── app.py           # The core Streamlit application logic
├── README.md        # The project's main documentation
└── requirements.txt # All Python dependencies

This simple layout separates concerns, defines dependencies, and uses a .env file to handle the GOOGLE_API_KEY securely—a critical first step in any project.

The First Wall: My Naive Code Meets an Asynchronous Beast

With the blueprint in place, I started with what I thought was the easiest part: uploading a video. My initial logic was simple: get the file from the user, upload it to Google's API, and immediately send it to the Gemini model for analysis.

It failed. Repeatedly.

After hours of debugging, I had my first major breakthrough. I discovered that the Google File API is asynchronous. When you upload a video, the API starts a background job and immediately returns a file object with a state of "PROCESSING". Trying to use that file for analysis right away is like trying to drink a coffee the instant you get the receipt—the barista hasn't even started making it yet.

The solution was to build a polling mechanism: a patient loop that periodically asks the API, "Is it done yet?" until the answer is yes. This required a complete rethinking of the workflow.

First, I needed to handle the uploaded file securely on the server. Python's tempfile module is perfect for this, as it creates a file in a secure location and guarantees a unique name.

# In app.py
import tempfile
from pathlib import Path

# ... (inside the 'if video_file:' block) ...

# The 'with' statement ensures the file object is properly managed.
# 'delete=False' is critical because it gives us control over when to delete the file.
with tempfile.NamedTemporaryFile(delete=False, suffix=".mp4") as temp_video:
    temp_video.write(video_file.read())
    video_path = temp_video.name # Store the path for later use.

Next came the core logic to handle the asynchronous upload and polling, wrapped in a try...finally block to guarantee that the temporary file is always deleted, even if something goes wrong.

# In app.py

try:
    with st.spinner("Processing video and gathering insights..."):
        # 1. Start the Asynchronous Upload
        st.write("Uploading file to Google...")
        processed_video = upload_file(path=video_path) # Starts the async job
        st.write(f"File '{processed_video.display_name}' is now processing...")

        # 2. The Patient Polling Loop
        # This loop is the heart of the solution.
        while processed_video.state.name == "PROCESSING":
            time.sleep(5) # Wait for 5 seconds to avoid spamming the API.
            processed_video = get_file(processed_video.name) # Ask for the latest status.

        # 3. Final Check for Success
        if processed_video.state.name != "ACTIVE":
            raise ValueError(f"
                    Video processing failed. State: {processed_video.state.name}
                    ")

        st.write("Video processing complete!")
        # The video is now ready for analysis!

        # ... (Agent execution logic will go here) ...

finally:
    # 4. The Guaranteed Cleanup
    # This block *always* runs, ensuring we clean up the temporary file.
    st.write("Cleaning up temporary files...")
    Path(video_path).unlink(missing_ok=True)

This approach is robust. It provides clear user feedback with st.spinner, handles the asynchronous nature of the API gracefully, and ensures the server stays clean.

From Working to Performant: The Magic of Caching

With the pipeline working, I noticed a new problem: the app was slow. Every time I asked a question, it felt like the entire AI agent was being reloaded from scratch. It was.

Streamlit re-runs the entire script on every user interaction, which meant my Agent object was being re-initialized constantly. The solution was a single, powerful decorator: @st.cache_resource.

# In app.py

@st.cache_resource
def initialize_agent() -> Agent:
    """ This function now only runs ONCE per user session. """
    return Agent(
        name="Video AI Summarizer",
        model=Gemini(id="gemini-2.0-flash-lite"),
        tools=[DuckDuckGoTools()], # We'll explore this in Part 2
        markdown=True,
    )

# The cached agent is retrieved instantly on subsequent reruns.
multimodal_agent = initialize_agent()

By adding this decorator, I told Streamlit to create the Agent object once and store it in cache. On every subsequent interaction, the app retrieves the already-initialized agent instantly. This simple change transformed the user experience from sluggish to seamless.

Onward to True Intelligence

We’ve successfully engineered a solid, performant, and robust pipeline. We can securely handle video uploads, navigate the complexities of asynchronous APIs, and efficiently manage our AI agent's lifecycle.

Want to dive into the code yourself? You can find the entire project, including a detailed README with setup instructions, on my GitHub repository.

But this is just the data pipeline. The real intelligence—the agent's ability to reason, analyze, and use tools—is yet to be built. How do we instruct our agent to perform complex reasoning? And how do we empower it to autonomously use its web search tool to find information that doesn't exist in the video?

In Part 2, we will dive into the heart of the Agno framework. We'll craft the prompts that guide our agent's analysis, enable it to use its search tool intelligently, and bring the full vision of our AI Video Summarizer to life. Stay tuned