No, really. Not all of us in tech, or even those in venture-backed businesses. By "we" I mean growth professionals and our teams.
Here, let me explain.
For those of us connected to the internet, this year has been full of opportunities to revile tech companies. From aiding authoritarian censorship
, to facilitating ethnic violence
, to shepherding a new generation into nicotine addiction
, tech’s transgressions read like cliche dystopian airport fic.
As is the norm whenever large-scale indignance arises1
, the media has run its course looking for the responsible parties2
. Many of the usual suspects have received their fair share of scorn: venture capitalists, whose success requires that their portfolio companies grow by more than 10% every month; audacious founders, prone to an ends-by-any-means world view; capitalism, as free markets don’t care about our privacy; and advertising-based business models, wherein there is an unavoidable gap between the goals of users and those of advertisers.
However, none of these factors are new. VC’s have always bet on big-winners. Founders have always had big heads—a requisite quality given the absurd odds we face. Capitalism has always needed to be checked by regulation, and advertising has been around since 4000 BC
What is new and powerful, however, is the discipline of growth: leveraging the scientific method to optimize our products. The fields emergence has been powered by two shifts: a proliferation of tools for collecting user data, as well as the broad acceptance a central tenet within behavioural economics: that seemingly unrelated changes to one’s environment can radically shape one’s choices.
For the uninitiated, growth comes in two flavours. The first, marketing-based growth, largely centers on building automated-systems to acquire users at little to no cost. The second, product-based growth, is focused on testing different versions of an application in order to increase some desired user behaviour.
Pages that only have one, large call to action button on them? growth. Countdown timers on checkout pages? growth. The little flickering image in your browser tab that indicates a new LinkedIn message? growth. The fact that sometimes you get a new Twitter notification, only to discover it’s one you’ve already viewed? I can’t be sure, but I would bet that’s growth too.
A classic example of our environment shaping our decisions, countries with "opt-out" organ donation (blue) have far higher organ donor rate (source).
Why We Love Growth
Now, just from the examples I’ve listed above you might not have a terribly favourable view of the discipline. However, it’s important to note that most growth actually works to help users achieve their goals.
For example, when working at CareGuide, my team changed onboarding such that families would post a job before driving them to message nannies. If given the choice, most people prefer to immediately start sending messages. However, data tells us that families are more likely to find a caregiver they’re happy with if they first go through the work of posting a job. As such, by shaping our product to encourage this choice, we help them avoid bypass a preference that would’ve hurt their odds of success.
Growth frequently unearths situations where a person’s higher level goals and their micro-level behaviours are in conflict—at which point we can help make them happier by altering our product to nudge them towards a more fulfilling behaviour.
Further, consider that most of the companies we love that have been born in the past decade wouldn’t have succeeded without employing growth tactics.
- AirBnB scraped phone numbers from craigslist to solicit users over to their platform
- Spotify & Pinterest both employed an invite-only growth hack to acquire initial users
- Internet.org, a non-profit aimed at helping the world obtain internet access, was conceived of as an elaborate growth tactic designed to help Facebook acquire more users
Growth plays a crucial role enabling innovation in a world dominated by near-monopolies (Facegoogapplazon), who can always outspend competitors on traditional marketing channels.
For those still unconvinced that growth can be used for good, I’ll instead argue that, practically, the toolkit is far too powerful to ever go away. History tells us that prohibition is a terribly ineffective policy
. As such, we stand to benefit from interrogating the strengths and shortcomings of this relatively nascent field.
Why Growth Is a Part of the Problem
As you likely expected, there are plenty of shortcomings for us to discuss. The initial click-baity-hook at the start of this article referenced that growth as a discipline has played a major role in fomenting society’s distrust of tech. This isn’t due to questionable values of practitioners so much as it is three flaws inherent in our tools.
1. We measure—and optimize for—activities, not internal states
When you “like” a photo, we don’t know if you did so because it made your laugh or inspired a sense of guilt. Despite these two realities having a drastically different effect on the user, the algorithmic input is the same: “show me more of this.”
In most cases, greater engagement with a thing implies affection for it (think books, sport equipment, etc.). However, what we’ve come to learn about dopamine-rich activities like eating chocolate or posting-photos is that we’ll often participate in an activity even when it makes us feel worse about ourselves3
This potential disconnect is even greater when we look back over a long period of time. A runner might not have felt bad about skipping her stretches on any one run, until months later when she injures her knee in a race. In general, humans make bad decisions when there are long feedback loops between outcomes and our actions.
Given that growth teams have very few tools to access motivational data, and even fewer that can account for a substantial lag between activity and regret, growth teams with the best of intentions can end up optimizing for things that aren’t good for their users.
2. We need to work quickly in order to have an impact
A key component of an effective growth team is speed. Typically the improvements growth brings to a product are incremental, generating between 1 and 10 percent improvements in some metric. Combine this with the fact that 50% of all tests will end up hurting the metric they are trying to improve (statistically speaking), and you can see why a growth team needs to move quickly in order to have an impact.
As such, most experiments are evaluated on the scale of weeks (or days, if you have facebook-esque scale). However, this gives little opportunity to surface how a user’s response might shift over time.
Imagine we launch an experiment that initially increases the messages sent per user by 20%, only to cause user fatigue such that a lower baseline message send rate is established (Figure 1). Pending how many new users start every day, and the variance among them, analysis done at the one week mark has a high likelihood of incorrectly suggesting that the variant is a significant improvement.
Figure 1. A common "overstimulation" pattern.
Aside: we see similar overstimulation patterns in our neurobiology, when an excitatory agonist (i.e. stimulant) is mitigated by downregulation of the receptors it acts on, leading to a lower baseline excitability for said neuron.
The end result here is that the timelines necessary for growth to be effective put the discipline at risk of sacrificing long term success in the name of short term gain.
3. We make decisions based on singular KPI’s
Almost universally, growth squads are focused on improving a singular metric. This is not only practical in ensuring focus, it’s actually best statistical practice (so as to avoid the multiple comparisons problem
This approach makes us prone to ignoring second-order effects and accepting trade offs with ancillary metrics that matter to the business. To provide a too-simple-for-real-life example, one could change Tinder’s algorithm so that every second profile was a famous celebrity. While this might increase right swipes per user, it would almost certainly decrease matches.
At the next level of abstraction, the tests and solutions are both chosen by algorithms5
, often at a scale that prevents meaningful human supervision of tradeoffs. As a result of this and the fact that nearly all systems demonstrate diminishing returns, it isn’t uncommon for a growth team to over optimize for some metric, at the expense of the macro-scale success of their users.
Too much of a good thing
Fixing The Problem Pt. 1 - Tactics
Thankfully, the thoughtful growth practitioner can mitigate these risks. The next section will focus on minimizing how often growth teams make decisions with unintended, or misunderstood, consequences. If this isn’t your jam, feel free to skip ahead to a brief final commentary on the ways we can engineer social norms for good.
1. Make cohort-based data visualization a key part of your test acceptance process
When a user’s response changes based on the time they’ve been exposed to some variant, it’s easy for us to make the wrong conclusion before user adaptation has run its course. To minimize the risks of this, make cohort-based data visualization a key part of your test acceptance criteria. Look to answer:
- “Has the response of this first (or second, or third) group of users changed over time in a way that could skew my results?” Hopefully not.
- “Do I feel confident that the response from this group has reached a steady-state that will persist into the future?” Hopefully so.
If your answers differ, keep the test running! As someone who’s been kept up at night wondering “Did I call that test too early?” I speak from experience when I say you’ll be glad you didn’t make a call (potentially) too soon.
2. Make test analysis a mini peer review process
To start, if at all possible, don’t have the person who designed the test analyze whether it was a win. While common at larger companies, smaller startups often lack a data team who can own this activity.
In this case, consider having team members assess one-another’s projects. Often, this intra-team evaluation will not only increase best practice sharing, but will push team members to be more rigorous in their experimental design (social pressure is a funny thing).
In either case, test review should look similar to peer review in academic journals. The reviewers should look for holes in experimental design and statistical methodology, while also considering potential measures of negative second-order effects. For the most critical changes, this analysis can then be shared with other experts (for example, other members of your data team) who may have some additional context or understanding that could add colour to the analysis.
3. Codify basic statistical tests & norms
Again, more a problem with small-to-mid-sized teams than large companies, but codifying which tests your company should use in all the most common cases (and making the code/scripts openly available) will help to mitigate risks associated with fishing-for-significance across multiple tests, or using sub-optimal analysis.
Some examples that come to mind:
4. Build in periodic long-term hold out tests
- Does your company use a chi-squared test or Bayesian methods when A/B testing?
- What statistical power6 and significance level7 do you use as your baselines?
- Do you have a well-understood minimum time an experiment should be live?
- For teams with fewer resources, when are you willing to accept results derived from Google Analytics or Optimize rather than your own database?
A hold out test is the process by which a cohort of users are chosen not to receive some set of changes. Often companies will employ hold out tests when they see a surprisingly large win and want to understand the degree to which it will fade8
Beyond the fact that things which seem too good to be true generally are, the understanding here is that a larger change in user behaviour has both potential for fatigue and significant opportunity for undesirable second-order outcomes.
However, equally pernicious, and harder to detect, is a slow slide into these negative second-order outcomes. To help account for this, teams can make it a habit to create a hold-out cohort at the start of every year. If 365 days later this group is testing better across some metrics than your “control” group, you’ve got an early warning sign that your team may be over optimizing for certain metrics at the expense of broader application health.
5. Make “user types” a core part of your experimental design
Perhaps the most common ways we can make the wrong decisions about our experiments has to do with variation across our users. It’s quite common for a given product to have power users, average users, and intermittent or infrequent users. Often, a given change will impact these groups differently.
If we ignore these user types, we become likely to accept changes that benefit the majority whilst harming the minority. On the other hand, if we only consider these groups after we’ve run a test, we’re likely to draw false conclusion due to repeated measures on related data.
Instead, we should build this analysis in to our experiment from the beginning. If we plan to do user type based analysis, we should set a threshold sample size for each user type (here’s a great tool for this
). Alternatively, one can feed these user types into a more sophisticated regression mode.
6. Establish trip wires for ancillary metrics
While laser-focus on a most-important metric is inevitable, teams should also pre-determine the ancillary metrics they care to monitor across all their tests. Once determined, absolute and relative thresholds for an “acceptable” decrease should be established for each of these outcomes. Otherwise, growth teams (read: humans) are likely to rationalize these undesirable trade offs when faced with a test that also improved their KPI.
For example, if an email alteration increased purchase rates by 1%, what increase in unsubscribe rate would you tolerate? At what absolute increase in unsubscribe rate would you rule any test result unacceptable?
An added benefit to explicitly discussing and codifying rules around ancillary metric changes is that they can then be easily built in to any automated systems which oversee high-scale testing.
7. Consider adding tapering to your tests & features
You can think of tapering as “fuck yes or no,” but for your product. The general idea is to stop showing features to people who aren’t responsive to them. In practice, this means exponentially increasing the time between feature exposure whenever a target action isn’t taken.
Tapering is widely used in email, where a string of unopened messages will result in Gmail relegating future emails into spam or the “promotions” folder. By slowly fading features out of a users field of vision when they are non-responsive, we mitigate unintended downsides.
In the future I believe tapering will be a standard component of experiment systems. However, as it stands, tapering both complicates analysis and demands additional engineering resources. As a starting point, consider applying tapering to a “v2” of tests whenever you see that the variant treatment improves outcomes for some users while reducing outcomes for others.
8. Measure & understand user disaffect
Quantifying user disaffect is the holy grail when it comes to ensuring growth wins are aligned with long term business success. It’s also far from straightforward. Here are a three approaches that can help provide a proxy:
8.a. Look at how a test alters how users flow between power users, average users, and infrequent users. The overall distribution may stay the same, but if substantially more users are moving between these buckets than is true in your control, you have reason to believe that the test is rather polarizing.
8.b. Interrogate users who drop from higher to lower engagement buckets the same way you would a cancelled subscription or lost revenue. Automate surveys / follow-up, and offer incentives for thoughtful, long-form feedback.
8.c. Monitor user sentiment, in app and across public social channels. As history has told us that different generations have radically different usage habits (shoutout to all the moms still on Facebook), try your best to monitor trends across ages & geographies.
Fixing The Problem Pt. 2 - Social Norms
If growing up in late stage capitalism has taught me anything, it’s to avoid having morality square off against personal incentives. So, while I’m dubious
of the impact any sort of “ethical code” might have, I’m a bullish on changing incentives through the creation of social norms9
Beyond having leaders invite discussion on the risks associated with short-term thinking, I feel that the most scalable way to create these norms are through pithy, memorable statements that might spread through mimetics.
Given Facebook’s recent history, the first thing that comes to mind here is to New York Times Test
your ideas: what would the cost to the business be if this growth experiment or feature was printed on the front page of the New York Times? If large, it’s likely an indicator you’ve ventured too far into the grey zone.
A second prompt that comes to mind is “would our power users love this?
” Understandably not every user will love every feature. However, if even your most ardent fans wouldn’t be glad for a product change, you’re likely wandering in to an ethically dubious zone.
Finally, I’d love to hear what other ideas you have to #unfucktech. Drop a comment or tweet this link with your thoughts, and I’ll retroactively add ideas to this (already long) essay!