Posted by Terence Zhang – Developer Relations Engineer and Lisie Lillianfeld – Product Supervisor
TalkBack is Android’s display reader within the Android Accessibility Suite that describes textual content and pictures for Android customers who’ve blindness or low imaginative and prescient. The TalkBack workforce is at all times working to make Android extra accessible. Immediately, because of Gemini Nano with multimodality, TalkBack routinely gives customers with blindness or low imaginative and prescient extra vivid and detailed picture descriptions to raised perceive the pictures on their display.
Rising accessibility utilizing Gemini Nano with multimodality
Advancing accessibility is a core a part of Google’s mission to construct for everybody. That’s why TalkBack has a function to explain photos when builders didn’t embrace descriptive alt textual content. This function was powered by a small ML mannequin known as Garcon. Nevertheless, Garcon produced quick, generic responses and couldn’t specify related particulars like landmarks or merchandise.
The event of Gemini Nano with multimodality was the proper alternative to make use of the most recent AI know-how to extend accessibility with TalkBack. Now, when TalkBack customers choose in on eligible units, the display reader makes use of Gemini Nano’s new multimodal capabilities to routinely present customers with clear, detailed picture descriptions in apps together with Google Pictures and Chrome, even when the system is offline or has an unstable community connection.
“Gemini Nano helps fill in lacking info,” mentioned Lisie Lillianfeld, product supervisor at Google. “Whether or not it’s extra particulars about what’s in a photograph a pal despatched or the model and lower of clothes when buying on-line.”
Going past primary picture descriptions
Right here’s an instance that illustrates how Gemini Nano improves picture descriptions: When Garcon is introduced with a panorama of the Sydney, Australia shoreline at evening, it would learn: “Full moon over the ocean.” Gemini Nano with multimodality can paint a richer image, with an outline like: “A panoramic view of Sydney Opera Home and the Sydney Harbour Bridge from the north shore of Sydney, New South Wales, Australia.”
“It is wonderful how Nano can acknowledge one thing particular. As an example, the mannequin will acknowledge not only a tower, however the Eiffel Tower,” mentioned Lisie. “This type of context takes benefit of the distinctive strengths of LLMs to ship a useful expertise for our customers.”
Utilizing an on-device mannequin like Gemini Nano was the one possible resolution for TalkBack to supply routinely generated detailed picture descriptions for photos, even whereas the system is offline.
“The common TalkBack consumer comes throughout 90 unlabeled photos per day, and people photos weren’t as accessible earlier than this new function,” mentioned Lisie. The function has gained optimistic consumer suggestions, with early testers writing that the brand new picture descriptions are a “recreation changer” and that it’s “great” to have detailed picture descriptions constructed into TalkBack.
Balancing inference verbosity and velocity
One necessary determination the Android accessibility workforce made when implementing Gemini Nano with multimodality was between inference verbosity and velocity, which is partially decided by picture decision. Gemini Nano with multimodality presently accepts photos in both 512 pixels or 768 pixels.
“The 512-pixel decision emitted its first token virtually two seconds sooner than 768 pixels, however the output wasn’t as detailed,” mentioned Tyler Freeman, a senior software program engineer at Google. “For our customers, we determined an extended, richer description was definitely worth the elevated latency. We had been capable of cover the perceived latency a bit by streaming the tokens on to the text-to-speech system, so customers don’t have to attend for the complete textual content to be generated earlier than listening to a response.”
A hybrid resolution utilizing Gemini Nano and Gemini 1.5 Flash
TalkBack builders additionally carried out a hybrid AI resolution utilizing Gemini 1.5 Flash. With this server-based AI mannequin, TalkBack can present the very best of on-device and server-based generative AI options to make the display reader much more highly effective.
When customers need extra particulars after listening to an routinely generated picture description from Gemini Nano, TalkBack provides the consumer an choice to hearken to extra by working the picture via Gemini Flash. When customers give attention to a picture, they’ll use a three-finger faucet to open the TalkBack menu and choose the “Describe Picture” choice to ship the picture to Gemini 1.5 Flash on the server and get much more particulars.
By combining the distinctive benefits of each Gemini Nano’s on-device processing with the complete energy of cloud-based Gemini 1.5 Flash, TalkBack gives blind and low-vision Android customers a useful and informative expertise with photos. The “describe picture” function powered by Gemini 1.5 Flash launched to TalkBack customers on extra Android units, so much more customers can get detailed picture descriptions.
Compact mannequin, huge impression
The Android accessibility workforce recommends builders wanting to make use of the Gemini Nano with multimodality prototype and take a look at on a strong, server-side mannequin first. There builders can perceive the UX sooner, iterate on immediate engineering, and get a greater concept of the best high quality attainable utilizing essentially the most succesful mannequin obtainable.
Whereas Gemini Nano with multimodality can embrace lacking context to enhance picture descriptions, it’s nonetheless greatest observe for builders to supply detailed alt textual content for all photos on their apps or web sites. If the alt textual content is just not supplied, TalkBack can assist fill within the gaps.
The Android accessibility workforce’s purpose is to create inclusive and accessible options, and leveraging Gemini Nano with multimodality to supply vivid and detailed picture descriptions routinely is an enormous step in direction of that. Moreover, their hybrid method in direction of AI, combining the strengths of each Gemini Nano on system and Gemini 1.5 Flash within the server, showcases the transformative potential of AI in selling inclusivity and accessibility and highlights Google’s ongoing dedication to constructing for everybody.
Get began
Study extra about Gemini Nano for app growth.
This weblog publish is a part of our sequence: Highlight Week on Android 15, the place we offer sources — weblog posts, movies, pattern code, and extra — all designed that will help you put together your apps and benefit from the most recent options in Android 15. You’ll be able to learn extra within the overview of Highlight Week: Android 15, which can be up to date all through the week.