WorldGaze smartphone program enhances AI assistants with image processing
Voice-activated assistants like Alexa, Cortana, and Siri may require very specific vocal prompts to provide the desired information. If the person requesting the information uses a camera-equipped smartphone, a new program called WorldGaze can provide visual prompts to AI assistants to supplement voice activation and provide more accurate answers.
WorldGaze, developed by researchers from the Human-Computer Interaction Institute at Carnegie Mellon University (Pittsburgh, PA, USA; www.cmu.edu) and Apple (Cupertino, CA, USA; www.apple.com) taps into a smartphone’s front-facing camera to track the user’s head and estimate the direction of their gaze. The software then uses the rear-facing camera to project the direction of the user’s gaze onto the immediate area and uses that line to define objects or regions of interest (ROI) within the camera’s view.
If the user’s question to the AI assistant seems relevant to an ROI, the assistant can draw upon that information when providing an answer. For example, if the user looks at a restaurant and asks, “When does this close?” the WorldGaze software may recognize the identity of the specific restaurant via comparison to Google (Mountain View, CA, USA; www.about.google) Streetview images, for example.
If the AI assistant can positively identify the precise identity of the restaurant, the assistant can then look up on the Web when that particular restaurant closes and provide the user with the correct answer.
The researchers used an iPhone XR with rear-facing 12 MPixel camera with 67.3° FOV and front-facing 7 MPixel camera with 56.6° FOV for development and testing. The Apple ARKit 3 (bit.ly/VSD-ARKIT) SDK provided a face API for head tracking using the front-facing camera and support for projecting a forward-facing head vector onto an image captured by the rear-facing camera.
The direction of someone’s gaze alone may not precisely identify the subject of an inquiry. For instance, someone might look into a restaurant with a menu displayed on the front window. WorldGaze may not be able to tell whether the person seeks information about the entire restaurant, or the menu specifically, as the person’s gaze may rest on either object.
The researchers therefore incorporated the Apple Vision Framework API to provide object recognition and segmentation to help accurately predict the target of the user’s gaze. The software uses the distance between the line of the user’s gaze to the centroid of different objects in the frame to rank targets by confidence and weighs prediction confidence based on the size of the objects.
Finally, WorldGaze integrates with voice-activated assistants by replacing ambiguous nouns like “this” with the identity of objects with the highest gaze probability. The researchers believe their software could eventually run on smart glasses as well as smartphones and be used in streetscape, retail, and smart home and office voice queries.
About the Author

Dennis Scimeca
Dennis Scimeca is a veteran technology journalist with expertise in interactive entertainment and virtual reality. At Vision Systems Design, Dennis covered machine vision and image processing with an eye toward leading-edge technologies and practical applications for making a better world. Currently, he is the senior editor for technology at IndustryWeek, a partner publication to Vision Systems Design.
