Thursday, March 3, 2016

How Google Works: A Google Ranking Engineer’s Story #SMX

How Google Works: A Google Ranking Engineer’s Story #SMX was originally published on BruceClay.com, home of expert search engine optimization tips.

Google Software Engineer Paul Haahr has been at Google for more than years. For two of them, he shared an office with Matt Cutts. He’s taking the SMX West 2016 stage to share how Google works from a Google engineer’s perspective – or, at least, share as much as he can in 30 minutes. After, Webmaster Trends Analyst Gary Illyes will join him onstage and the two will field questions from the SMX audience with Search Engine Land Editor Danny Sullivan moderating (jump to the Q&A portion!).

From left: Google Webmaster Trends Analyst Gary Illyes, Google Software Engineer Paul Haahr and Search Engine Land Editor Danny Sullivan on the SMX West 2016 stage in San Jose.

How Google Works

Haahr opens by telling us what Google engineers do. Their job includes:

  • Writing code for searches
  • Optimizing metrics
  • Looking for new signals and combine old signals in new ways
  • Moving results with good ratings up
  • Moving results with bad ratings down
  • Fixing rating guidelines or developing new metrics when necessary

Two parts of a search engine:

  • Ahead of time (before the query)
  • Query processing

Before the Query

  • Crawl the web
  • Analyze the crawled pages
    • Extract links
    • Render contents
    • Annotate semantics
  • Build an index

The Index

  • Like the index of a book
  • For each word, a list of pages it appears on
  • Broken up into groups of millions of pages
  • Plus per-document metadata

Query Processing

  • Query understanding and expansion
    Does the query name any known entities?
  • Retrieval and scoring
    • Send the query to all the shards
      Each shard
      • Finds the matching pages
      • Computes a score for query+page
      • Sends back the top N page by score
    • Combine all the top pages
    • Sort by score
  • Post-retrieval adjustments
    • Host clustering
    • Is there duplication

Scoring Signals

A signal is:

  • A piece of information used in scoring
  • Query independent – feature of a page
  • Query dependent

Metrics

“If you cannot measure it, you cannot improve it” – Lord Kelvin

  • Relevance
    • Does a page usefully answer the user’s query
    • Ranking’s top-line metric
  • Quality
    • How good are the results we show
  • Time to result (faster is better)

Google measure itself with live experiments:

  • a/b experiments on real traffic
  • look for changes in click patterns
  • a lot of traffic is in one experiment or another

At one time, Google test 41 different blues to see what was best.

Google also does human rater experiments:

  • Show real people experimental search results
  • Ask how the results are
  • Ratings are aggregated across raters
  • Publish guidelines explaining criteria for raters
  • Tools support doing this in an automated way, similar to Mechanical Turk

Google judges pages on two main factors:

  • Needs Met (where mobile is front and center)
  • Page Quality

Needs Met Grades

  • Fully Meets
  • Very Highly Meets
  • Highly Meets
  • Moderately Meets
  • Slightly Meets
  • Fails to Meet

Page Quality Concepts:

  • Expertise
  • Authoritativeness
  • Trustworthiness

Google Engineer Development Process

  • Idea
  • Repeat until Ready
    • Write code
    • Generate data
    • Run experiments
    • Analyze
  • Launch report by quantitative analyst
  • Launch review
  • Launch

What goes wrong?

There are two kinds of problems:

  • Systematically bad ratings
  • Metrics don’t capture the things we care about

Here’s an example of a bad rating. Someone searches for [Texas farm fertilizer] and the search result provides a map to the manufacturer’s headquarters. It’s very unlikely that that’s what they want. Google determines this through live experiments. If the raters see the maps and rate it needs highly met, this is a rater failing.

Or, what if the metrics are missing? In 2009-2011, three were lots of complaints about low-quality content. But relevance metrics kept going up, due to content farms. Conclusion: Google wasn’t measuring the metrics they needed to be. Thus, the quality metric was developed apart from relevance.

Gary Illyes and Paul Haahr Answer Questions from the SMX Audience

SMX: How does RankBrain fit into all of this?

Haahr: RankBrain gets to see a subset of the signals. I can’t go into too much detail about how RankBrain works. We understand how it works but not as much what it’s doing. It uses a lot of the stuff that we’ve published about deep learning.

How would RankBrain know the authority of a page?

Haahr: It’s all a function of the training that it gets. It sees queries and other signals. I can’t say that much more that would be useful.

SMX: When you are logged into a Google app, do you differentiate by the information you gather? If you’re in Google Now vs. Chrome can that impact what you’re seeing?

Haahr: It’s really a question of if you’re logged in or not. We provide a consistent experience. Your browsing history follows you to either.

Does Google deliver different results for the same queries at different times in the day?

Illyes: I’m not sure. In maps, for example, if we display something maps related we will show the hours (but it doesn’t change what shows up, to Gary’s knowledge).

SMX: What’s going on with Panda and Penguin?

Illyes: I gave up on giving a date or timeline on Penguin. We are working on it, thinking about how to launch it, but I honestly don’t know a date and I don’t want to say a date because I was already wrong three or four times, and it’s bad for business.

SMX: Post-Google Authorship, how are you tracking author authority?

Haahr: There I’m not going to go into any detail. What I will say is the raters are expected to that manually for a page that they are seeing. What we measure is are we able to a good job of getting things that the raters think are good authorities.

SMX: Does that mean authority is used as a direct or indirect factor?

Haahr: I wouldn’t say yes or no. It’s much more complicated than that and I can’t give a direct answer.

SMX: When explicit authorship ended, Google did say to keep having bylines? Should you bother with rel=author at all?

Illyes: There is at least one team that is still looking into using the rel=author tag just for the sake of future developments, if I were an SEO I would still leave the tag. It doesn’t hurt to have it. On new pages, however, it’s probably not worth it to have. Though we might use it for something in the future.

SMX: What are you reading right now?

Haahr: I read a lot of journalism and very few books. However, I just finished “City on Fire” – it’s about New York in the ’70s. There are 900 pages and I was disappointed when it ended. I’ve just started “It Can’t Happen Here.”

Subscribe to the BCI blog link

No comments:

Post a Comment