Google Assistant (Local Discovery, Driving, Transit)
How can we ensure the Google Assistant works everywhere and for everyone?
The Geo Assistant aims to be an effective personal companion for the real world – whether that’s providing useful information for the morning commute, suggesting local places to explore, or answering questions people have about their surroundings.
Given this, how would we make sure we’re actually providing value and solving genuine problems – not just for people in Silicon Valley, but also for people and communities worldwide?
And how might we measure our progress in doing so over time?
To do this, we conducted a UX baseline lab study with external participants in 4 cities worldwide: Tokyo 🇯🇵, New Delhi 🇮🇳, New York City 🇺🇸, and Mountain View 🇺🇸. We focused on how production versions of the Geo Assistant perform for our critical user journeys in Commute, Transit, Local Answers & Actions. The goal of this study is to understand Geo Assistant performance and establish a baseline to track UX progress (e.g., HEART metrics) every six months.
We feature a combination of participant self-reported metrics (through an administered survey after each task), and researcher reported metrics (e.g., task success, discoverability of relevant features),
Happy to explain further details in person. Confidential information will not be provided.
Baseline Study Approach
Study Goal: Understand how production versions of the Geo Assistant perform worldwide, and establish a baseline to track UX progress every six months.
Challenges we addressed along the way:
Localization: A one-size-fits all script wouldn’t be relevant in all places around the world. International communities have their own tastes and needs. So we’ll start with a base of potential We start with a skeleton or base, and then tailor the particular destinations from there. We also work with translators and international teammates to make sure we meet the users where they are. That adds more time necessary for preparation work.
Logistics: Coordinating with research partners and third-party vendors who may have the localization expertise, but may not be as familiar with this particular digital product or experience, or the context and history. There’s an onboarding process for bringing people onboard.
Managing (enthusiastic) stakeholder expectations: The great news is: nearly all of our stakeholders were excited about this research, and wanted to get involved. But when we’re juggling asks from different roles (e.g., engineering asks and product manager asks) and different teams/product areas (and each team has a multitude of user journeys they want to test), the list of requests and potential study questions ballooned in scope. We only have 1 hour scheduled with each participant. We have 1 hour scheduled with each participant, and there won’t be enough time to account for all of the different tasks out there. By involving the diverse stakeholders early in the script writing process, we narrowed down the different asks to ten scenarios/tasks, along with some additional bonus tasks if there was extra time.
Deliverables & Impact
This was the first international baseline study for the Geo Assistant team.
Informed 2020 product strategy on themes related to personalization and a more helpful, proactive Assistant. Teams from four different product areas (Transit, Driving, Local Answers + Actions, Next Billion Users) were present for the presentation and follow-up discussion. Stakeholders included product managers, interaction designers, engineering, data scientists & analysts.
Captured 10 bugs in production versions of the Geo Assistant for the engineering team.
Team research impact: First international baseline study for the Geo Assistant team. Serves as a template for future studies, especially since the team plans on repeating this study every six months to see how far the Geo Assistant has improved with every product release cycle.
Organizational research impact: Influenced other Google Assistant teams and organizations who were looking for ways of establishing their own baseline to measure their product performance over time.
Research communication: UX baseline scores were integrated into a dashboard with the engineering team, where the UX success metrics now live along with other key performance metrics.
Reflections & Learnings
Evaluating the conversational experience over time can be tricky. Evaluating the conversational experience is tricky, especially over a longer period of time. (Say, the difference between a 4.2 and a 4.5 on a self-reported metric like Ease of Use, but there are other ways of trying to). Usability Metrics don’t work too well in evaluating a conversational experience. The value in these kinds of research projects is the qualitative feedback, and provide some preliminary measures on.
Scaling the traditional lab study to work internationally and not just locally is tricky. But it’s so worth it. It also takes a while. Being able to unite all the separate studies into a single narrative can get incredibly tricky. Or at least time-consuming.
Evaluations across cultures are not always equal. Some participants on average are more forgiving than others.
How often should we re-evaluate and try the baseline again? The original timeline we proposed is once every 6 months. But are all teams shipping at a rapid rate of 6 months, or pushing enough changes to warrant another study? Maintaining these kinds of measurements takes a lot of extra effort.