Content collection, or what we like to call “content theft”, has been a problem since the birth of the Internet. For anyone who regularly publishes or uses search engine optimization (SEO), this can actually be very exasperating.
Hidden
what is content collection?
content gathering basically means that someone takes your content and uses it on their own website (manually or automatically using plug-ins or robots) without giving you an autograph or credit. This is usually a desire to get traffic, SEO, or new users in some way. This actually violates the copyright laws of the United States and some other countries. Google does not condone this either and suggests that you should create your own unique content.
the following are several examples of content collection mentioned by Google: sites where
- copies and republishes content from other sites without adding any original content or value
A site where - copies content from other sites, modifies slightly (for example, by replacing synonyms or uses automation technology) and republishes content feeds from other sites
- copies content feeds from other sites but does not provide users with some type of unique organization or benefit
- sites that specifically embed content (such as videos, images, or other media) from other sites But there is no substantial added value to users
do not confuse this with content syndication, which is usually when you republish your own content to expand coverage. Syndication content can also be done by a third party, but there is a slight difference between this and content crawling. If someone is federating content, always use a special tag, such asrel=canonical
ornoindex
.
now has many third-party WordPress plug-ins that allow you to automatically crawl third-party RSS sources. Although developers have good intentions, unfortunately, these intentions are sometimes abused and used for content collection. One of the reasons why WordPress is so popular is that it is easy to use, but sometimes it is counterproductive. An example of a
content-gathering farm
when the same owner crawls content on dozens of websites, we call them “farms”. These are usually easy to find because WordPress site owners usually use the same theme on all sites, and even the differences between domain names are small.
, we use a living example in today’s post! We are not ashamed to point out these types of sites because they provide no value and only negate the hard work done by content publishers. This is an example of a content crawl farm. We have archived each link in case the site breaks down in the future. You can click each of them and see that they all use the same theme and the same crawling content. Usually, scrapers get content from many different sources, and our blog is one of them.
- thetechworld.xyz (archive link)
- mytechnewstoday.org (archive link)
- mytechcrunch.com (archive link)
- technewssites.xyz (archive link)
- technewssites.info (archive link)
- www.thetechworld.info (archive link)
- www.mytechnewstoday.xyz (archive link)
- www.futuretechnologynews.info (archive link)
- futuretechnologynews.xyz (archive link)
you can see below, they just collect blog posts word for word. How can
capture samples be found by
? One of the easiest ways for
to find them is to use tools such as Copyscape (which does not support Chinese) or Ahrefs (if they are also copying your internal links). Copyscape even allows you to submit a site map file and have it automatically notify you when it scans the network and finds content.
copy
you can also manually search for Google using the “allintitle” tag. Just enter the label and the title of your article.
uses the allintitle tag to search for the Google
allintitle keyword prompting Google to search for these words only in the article title. The second and more effective way is to search for some text in your posts, with search terms enclosed in double quotes. Double quotation marks tell Google to search for exactly the same text. There may be false positives in your title search because someone may use the same title, but the second method is more effective because someone is less likely to have exactly the same sentence or paragraph. Will
content collection affect search engine optimization?
the next question you may encounter is, how does this affect SEO? Because in the above example, the content collection farm does not userel=canonical
tags, give credit, ornoindex
tags. This means that when Google bot grabs it, it thinks this is their original content. You may think it’s not fair. You’re right, it’s not. We posted the content, and they just grabbed it. However, before you start to panic, it’s important to understand what’s really going on behind the scenes.
first, even though Google crawlers may see it as their content, the Google algorithm probably won’t. Google is not stupid and has a lot of rules and checks to ensure that the original content owner still has credibility. How do we know? Well, let’s take a look at each of these posts from the perspective of SEO. In the
example, the site collected other people’s blog posts as early as November 2017, so it has plenty of time to rank if it wants to rank. Therefore, launch the Ahrefs tool and check the keywords for the current ranking of their articles. We can see that it does not rank for any keywords. Therefore, in terms of natural traffic, they did not benefit from this article at all.
Content Collection SEO
if we extract our original blog posts in Ahrefs, we can see that we rank 96 keywords.
original content search engine Optimization
when Google sees content that you might think is duplicated, it uses many different signals and data points to determine who wrote the content in the first place and which content should be ranked. Here are a few examples:
- release date (although in this case, the content was crawled on the same day)
- domain permissions and page rankings. Yes, page rankings may still be used within Google
- social signal
- traffic
- backlink
again, these are safety assumptions, because no one really knows what Google uses. But the point here is that you may not need to lose sleep over someone grabbing your content. However, you may still want to do something about it. It is not impossible for others to surpass you with your own content. We will discuss it further below.
how do we deal with content gathering
it’s not easy to create useful, unique, and shared content, it takes up a lot of your precious time (and usually costs a lot of money), so you should definitely protect it. But there are some additional reasons that you may not want to ignore the scraper.
- if a website with a lot of traffic is grabbing your content and using it to supplement other content, then its is likely to benefit from it. This is absolutely wrong because you are the original owner of the content.
- something like this can seriously distort the data in your reporting tools, making your life more difficult. For example, these will be displayed in the backlink report of tools such as Ahrefs or Majestic. The older you are, the more chaotic it will be.
- do you want to trust Google completely to determine whether their content or your content is original? Smart as they are about it, we certainly won’t. In addition, even if their post doesn’t have any keywords in the search engine rankings, it has actually been indexed by Google (shown below). The content collected by
has been indexed
contacted the website owner and submitted a DMCA complaint to
to ensure that we get credit when our credit expires, we usually first contact the website owner and request deletion. We recommend creating email templates that you can reuse to speed up the process rather than wasting your time. If we do not hear from them after several attempts, we will go further and file a DMCA complaint.
DMCA complaints can be a bit tricky because you need to find the IP of the website, the host, and so on. But don’t worry, we’ve documented all the steps to easily file a DMCA complaint and track the owner. You can also submit a legal deletion request directly to Google.
in the case of the real-time case study above, it seems like it’s time to take the next step because we can’t reach the site owner.
update reject file
to ensure that these do not affect our website in any way (no matter what happens to the DMCA complaint), we also add these entire domains to our rejection file. This tells Google that we don’t want to have anything to do with them, and we won’t try to manipulate SERP in any way.
if you do this for a higher-quality Web site, you can also just submit the URL for rejection instead of submitting the entire domain. Although usually we don’t see high-quality website crawling content.
Step 1
in Ahrefs, we select the domain in question and click “Disavow Domains”. This ensures that all the content in the content crawling site will not affect us.
Ahrefs reject Domain
when dealing with these types of problems, the greatness of Ahrefs lies in its “hide reject links” option. Then, it will automatically hide the domain and URL, and will not appear in your master report later. This is great for organizing and staying sane, especially if you specialize in using Ahrefs to manage backlinks.
hides rejected links
Step 2
as you can see below, we added all the fields from the content crawl farm to the reject link section of the Ahrefs. The next step is to click Export and get the reject file (TXT) that we need to submit in Google Search Console.
export reject file
Step 3
then goes to Disavow Tool of Google. Select your Google Search Console profile and click “Disavow LINKS”.
refuses to link to
Step 4
selects the reject file that you exported from Ahrefs and submits it. This will overwrite your previous rejection file. If you have not used Ahrefs before and a reject file already exists, it is recommended that you download the current file, merge it with the new file, and then upload it. From then on, if you only use Ahrefs, you can simply upload and overwrite.
rejects the file
shielding the crawler’s IP
you can go further and block the crawler’s IP. Once you have identified abnormal traffic (which is sometimes difficult to do), you can use .htaccess files or Nginx rules to block it on your server. Or, if you are using a third-party WAF such as Sucuri or Cloudflare, they also have the option to intercept IP.
summary
content-gathering farms may not always affect your SEO, but they definitely don’t add any value to users. We strongly recommend that you take some time to take them down. We have an entire Trello card dedicated to processing delete requests. This helps make the web a better place for everyone and ensures that your unique content will only be seen and ranked on your site.
In addition, we suggest webmasters, blindly intact to collect content, it is difficult to let the site have a good ranking. If you want to be a content aggregation site, we suggest:
(1) insists on doing a certain proportion of original content, we can’t give an accurate proportion, but for new sites, original content should account for a larger proportion;
(2) even for content collection, we should consider doing some deep processing of the content, using tools or re-editing it manually.
(3) uses search push plug-ins to push content to search engines in a timely manner.