Crawling is the very first stage of a search engine discovering your website content. Without crawling, you can’t get indexed, ranked, or, most significantly, traffic.
The process itself is remarkably complicated and completely automatic. Though not recent, in 2016, Google announced they have knowledge of 130 trillion pages.
However, research from Ahrefs also shows that 96.55% of content gets zero traffic from Google. This showcases that search engines, like Google, only crawl and index “some” content.
In this article, we’ll break down the stages of search engine crawling so you have a better understanding of how search engines crawl websites.
What is Search Engine Crawling?
Google defines the web crawling process as discovering and downloading text, images, and videos from a web page found online with bots called search engine web crawlers. Crawlers are often referred to as spiders.
These crawlers find new pages by travelling through URLs (links). They crawl sitemaps, internal link, and backlinks to find additional pages that haven’t been crawled. Once they find a new page, they extract the information to index it in their database.
Different Search Engine Bots & User Agent Strings
It’s good to know the different search engine bots and user agent strings to better understand search engine crawling.
Bots are crawlers. These are automated programs whose sole job is to discover new web pages online. User-agent strings are unique identifiers that these bots use to announce themselves when they request access to a website’s server.
This is important to know that if you block access, your page will not get ranked on a particular search engine.
Search Engine | Bot Name | User Agent String Example | Purpose |
Googlebot | Googlebot/2.1 (+http://www.google.com/bot.html) | Crawls and indexes web pages for Google Search and other Google services. | |
Bing | Bingbot | Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) | Crawls and indexes web pages for Bing Search. |
DuckDuckGo | DuckDuckBot | DuckDuckBot/1.0; (+http://duckduckgo.com/duckduckbot.html) | Crawls and indexes web pages for DuckDuckGo Search. |
Yandex | YandexBot | Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) | Crawls and indexes web pages for Yandex Search. |
Baidu | Baiduspider | Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html) | Crawls and indexes web pages for Baidu Search. |
Apple | Applebot | Applebot/1.0 (+http://www.apple.com/go/applebot) | Crawls and indexes web pages for Apple services like Spotlight Search. |
The Stages of Search Engine Crawling
Before serving search results, a search engine must undergo a complex crawling process. The process can be broken down into five stages. These include:
Stage 1: Discovery
The crawling process starts with URL discovery. Search engines, like Google, use bots (in this case, Googlebot) to find new pages they haven’t crawled.
Typically, search engines have a huge queue of URLs waiting to be crawled. This is why optimising crawlability is important to ensure your page gets crawled the first time.
There are several ways search engines discover new content. This is either through:
- Crawling previously crawled URLs for updates or changes.
- Crawling a website’s XML sitemap to find new pages.
- Crawling internal and external links to find new pages.
Remember, however, “Crawling is not a guarantee you’re indexed.” – Rand Fishkin
If your web page is discovered, it doesn’t mean it’ll be indexed. Search engines need a whole set of requirements before they index your web pages.
Stage 2: Fetching
Once a URL is selected from the queue, the search engine crawler sends a request to the web server of the hosting page.
The server then responds by sending the page’s content, in most cases, in HTML format. This HTML code contains the structure and text of the page. It may even include links to other resources like images, CSS, stylesheets, and JavaScript files.
Stage 3: Parsing
After fetching the page, a crawler parses (aka analyses) the HTML content to extract information.
The type of information it extracts consists of the following:
- Links: The crawler identified all the links within the HTML code, such as internal and backlinks. These are then added to the discover queue for future crawling.
- Resources: The crawler also extracts references from other resources embedded in the HTML, such as images, CSS stylesheets, and JavaScript Files. These are then fetched and analysed separately.
- Metadata: They also extract metadata, such as a page’s title, description, and keywords, which are used to understand the page’s content and relevance.
Stage 4: Rendering
Nowadays, pages are built using JavaScript. Therefore, crawlers cannot read the HTML code to understand a website’s content.
To fully understand the content of a webpage, search engines need to go through the rendering process. During rending, search engines execute the JavaScript code on a webpage, which then simulates how a browser would display it to a user.
This gives the crawler a better understanding of the web page content, including any dynamically generated elements.
If, however, the crawler experiences an HTTP error, like HTTP 500, the bot may crawl the page less efficiently. When this occurs, they typically adjust their crawl rate to avoid overloading the server.
Stage 5: Indexing
The final stay of the process is search engine indexing. Indexing involves storing the parsed and rendered information in a search engine’s index.
The index stores various pieces of information about each page, such as its content, links, metadata, and several other relevant signals.
This information is then used by the search engine’s ranking algorithms to determine which pages are most relevant to a particular query.
Crawling vs. Indexing
People often confuse crawling and indexing, but they’re completely different.
Here’s a side-by-side comparison to help:
Feature | Crawling | Indexing |
Definition | Find new/updated web pages | Stores & organises page info |
How it works | Bots follow links | Bots analyse content & extract key info |
Sources | Links, sitemaps, URL submissions | Crawled pages |
Goal | Gather page data for the index | Create a searchable database |
Control | Robots.txt guides bots | Robots.txt controls indexed pages |
Outcome | List of URLs | Searchable database for results |
Crawling comes before indexing. The crawl helps with the indexing process.
How Search Engines Discover and Index Web Pages
Now that you know more about crawling, let’s look into how search engines discover and index web pages.
Discovery
It all starts in the discovery phase. We touched on this earlier, so we won’t go into too much detail, but during this stage, the crawlers need to seek out pages by:
- Crawling: Search engines use bots to crawl the internet via links and try to find unknown web pages.
- Sitemaps: Crawlers explore sitemaps and try to find new pages on a website they haven’t crawled.
- Page Submissions: To speed up crawling, you can also manually submit individual URLs using search engine console tools like Google Search Console.
Indexing
Once a page has gone through the discovery process and has been found by crawlers, it can then be indexed. This is when the web page content is processed and stored so that search engines can retrieve it when somebody searches for relevant content.
During this process, search engines are looking at various index factors, such as:
- Content quality and relevance
- Keyword usage
- Title tags and meta descriptions
- Header tags (H1, H2, etc.)
- Internal and external links
- Image alt text
- Page load speed
- Mobile-friendliness
- Structured data
- Social signals
- Freshness of content
- Domain authority and trustworthiness
- User engagement metrics
- Robots.txt file demands
Conclusion
With so much jargon out there, it’s easy to get confused on more technical terms but understanding crawling patterns and behaviours through your website can give you a headstart into understanding what Google is and isn’t liking about your site – and what it can and can’t reach. By improving your site’s XML sitemap to ensure all of your most important pages are listed, your site’s internal linking and information architecture is strong to help crawlers find each page and generally keeping in line with SEO best practices will help your site to be crawled more effectively and increase your chances of being indexed.
Need some help getting your technical SEO up to scratch? Get in touch with an expert member of our team today to see how we can help.
FAQs
How often do search engine crawlers revisit web pages?
Nobody knows as it’s an automatic process. However, you can increase your crawl speed by manually submitting a URL or sitemap or increasing internal and external linking.
Can search engine crawlers access non-text files like images and videos?
Yes. They can access non-text files like images and videos. Although they can’t visually see the image or video, they extract information like file names, alt text, captions, and surrounding text to help them understand the file.
How do search engines discover new pages?
Search engines discover new pages using search engine crawlers. These crawlers find new pages by following internal and external links, sitemaps, or through a manual page submission.