WorldGaze smartphone program enhances AI assistants with image processing

Voice-activated assistants like Alexa, Cortana, and Siri may require very specific vocal prompts to provide the desired information. If the person requesting the information uses a camera-equipped smartphone, a new program called WorldGaze can provide visual prompts to AI assistants to supplement voice activation and provide more accurate answers.

WorldGaze, developed by researchers from the Human-Computer Interaction Institute at Carnegie Mellon University (Pittsburgh, PA, USA; www.cmu.edu) and Apple (Cupertino, CA, USA; www.apple.com) taps into a smartphone’s front-facing camera to track the user’s head and estimate the direction of their gaze. The software then uses the rear-facing camera to project the direction of the user’s gaze onto the immediate area and uses that line to define objects or regions of interest (ROI) within the camera’s view.

If the user’s question to the AI assistant seems relevant to an ROI, the assistant can draw upon that information when providing an answer. For example, if the user looks at a restaurant and asks, “When does this close?” the WorldGaze software may recognize the identity of the specific restaurant via comparison to Google (Mountain View, CA, USA; www.about.google) Streetview images, for example.

If the AI assistant can positively identify the precise identity of the restaurant, the assistant can then look up on the Web when that particular restaurant closes and provide the user with the correct answer.

The researchers used an iPhone XR with rear-facing 12 MPixel camera with 67.3° FOV and front-facing 7 MPixel camera with 56.6° FOV for development and testing. The Apple ARKit 3 (bit.ly/VSD-ARKIT) SDK provided a face API for head tracking using the front-facing camera and support for projecting a forward-facing head vector onto an image captured by the rear-facing camera.

The direction of someone’s gaze alone may not precisely identify the subject of an inquiry. For instance, someone might look into a restaurant with a menu displayed on the front window. WorldGaze may not be able to tell whether the person seeks information about the entire restaurant, or the menu specifically, as the person’s gaze may rest on either object.

The researchers therefore incorporated the Apple Vision Framework API to provide object recognition and segmentation to help accurately predict the target of the user’s gaze. The software uses the distance between the line of the user’s gaze to the centroid of different objects in the frame to rank targets by confidence and weighs prediction confidence based on the size of the objects.

Finally, WorldGaze integrates with voice-activated assistants by replacing ambiguous nouns like “this” with the identity of objects with the highest gaze probability. The researchers believe their software could eventually run on smart glasses as well as smartphones and be used in streetscape, retail, and smart home and office voice queries.

WorldGaze smartphone program enhances AI assistants with image processing

About the Author

Dennis Scimeca

Related

German Industrial Automation Revenues to Drop by 10% in 2025

AI-Based Machine Vision Boosting Distribution Yard Efficiency

Voice Your Opinion!

To join the conversation, and become an exclusive member of Vision Systems Design, create an account today!

Trending

How to Design Custom Microscope Objectives for Imaging Applications

LUCID Launches Line of New Industrial Cameras

Focus on Vision: Robotic Paint Inspection, Battery Detection | June 20, 2025