CodeIgniter Forums
sitemap.xml for auth sites - Printable Version

+- CodeIgniter Forums (https://forum.codeigniter.com)
+-- Forum: Using CodeIgniter (https://forum.codeigniter.com/forum-5.html)
+--- Forum: Best Practices (https://forum.codeigniter.com/forum-12.html)
+--- Thread: sitemap.xml for auth sites (/thread-74860.html)

Pages: 1 2


sitemap.xml for auth sites - muuucho - 11-16-2019

I have created a sitemap.xml for a project with an auth system. So far I have just added the pages outside that wall. Is there a way to let crawlers login and crawl on the pages that requires authentication? If so, I like to add that pages in my sitemap. 
I am trying to get an app SEO friendly. 

Eh, not much Codeigniter in there, but what could possibly go wrong by posting to this friendly community?  Angel


RE: sitemap.xml for auth sites - jreklund - 11-16-2019

No, it can only crawl public pages.


RE: sitemap.xml for auth sites - muuucho - 11-16-2019

(11-16-2019, 02:14 PM)jreklund Wrote: No, it can only crawl public pages.
The reason that I ask is that I found one page out there in the wild that were talking about creating a user account for crawlers with read-only permissions. The idea was to in a public file instruct a crawler to login with that accounts username/password. Maybe not a good practice from a security point of view, but maybe seo-friendly.
Also, if you use something like Google Analytics, at least Google gets the uris from the logged-in pages. I wonder if that could confuse google. I mean I have my sitemap with non-loggedin uris only, but google sees that the page has much more uris than mentioned in the sitemap.xlm.


RE: sitemap.xml for auth sites - jreklund - 11-17-2019

You can't make an user/password, that would be a security hole. And they don't even pick up on that.
What you need to to is look for a googlebot user agent, and verify it's origin. And let that bot inside without username/password.

https://support.google.com/webmasters/answer/80553?hl=en

So, it will be semi-public.

If Google starts crawling those pages (you got Google Search Console right?) you should see those urls failing. You should return a correct header for Google and other bots if they find a page that require login. Like 403. If you return 200, they will think it's the correct content showing up.


RE: sitemap.xml for auth sites - dave friend - 11-17-2019

(11-16-2019, 05:06 PM)muuucho Wrote: The reason that I ask is that I found one page out there in the wild that were talking about creating a user account for crawlers with read-only permissions.

This feels a little like a social engineering exploit. Actually, it feels a lot like one.


RE: sitemap.xml for auth sites - muuucho - 11-17-2019

(11-17-2019, 06:53 AM)dave friend Wrote:
(11-16-2019, 05:06 PM)muuucho Wrote: The reason that I ask is that I found one page out there in the wild that were talking about creating a user account for crawlers with read-only permissions.

This feels a little like a social engineering exploit. Actually, it feels a lot like one.
Well, it was in a foreign language to me and I just read i briefly, so I might have misunderstood the concept.

(11-17-2019, 04:14 AM)jreklund Wrote: You can't make an user/password, that would be a security hole. And they don't even pick up on that.
What you need to to is look for a googlebot user agent, and verify it's origin. And let that bot inside without username/password.

https://support.google.com/webmasters/answer/80553?hl=en

So, it will be semi-public.

If Google starts crawling those pages (you got Google Search Console right?) you should see those urls failing. You should return a correct header for Google and other bots if they find a page that require login. Like 403. If you return 200, they will think it's the correct content showing up.
OK, this is way ahead of what I have been doing before. I will try to learn more about the concept before I give it a try. Thanks for your effort!


RE: sitemap.xml for auth sites - muuucho - 11-21-2019

(11-17-2019, 04:14 AM)jreklund Wrote: You can't make an user/password, that would be a security hole. And they don't even pick up on that.
What you need to to is look for a googlebot user agent, and verify it's origin. And let that bot inside without username/password.

https://support.google.com/webmasters/answer/80553?hl=en

So, it will be semi-public.

If Google starts crawling those pages (you got Google Search Console right?) you should see those urls failing. You should return a correct header for Google and other bots if they find a page that require login. Like 403. If you return 200, they will think it's the correct content showing up.
So, I create an account for Googlebot in my auth system(I make up an email and a password). 
Should I then do something like this in MY_Controller: (code updated)
PHP Code:
// Only check if  not logged in users are a google bot
if(!$this->ion_auth->logged_in()) {    
    
// Verify if caller is a google bot
    $remote_ip $_SERVER['REMOTE_ADDR'];
    $domain gethostbyaddr($remote_ip);
    if (in_array($domain, ['googlebot.com''google.com']) && gethostbyname($domain) == $remote_ip) {
        // Caller is a google bot, login the caller and redirect to some logged_in page where the "inside" crawl can start...
        $this->ion_auth->login('made_up_email''made_up_password');
        redirect('auth/logged_in_start');
    

// Run script... 



RE: sitemap.xml for auth sites - jreklund - 11-21-2019

Nope, all your urls will contain whatever "auth/logged_in_start" is. You should let them in to the original content. Don't display a 403, that's respons should be for everyone accessing a valid url and aren't logged in.


RE: sitemap.xml for auth sites - muuucho - 11-21-2019

(11-21-2019, 11:13 AM)jreklund Wrote: Nope, all your urls will contain whatever "auth/logged_in_start" is. You should let them in to the original content. Don't display a 403, that's respons should be for everyone accessing a valid url and aren't logged in.
Sorry, I thought you suggested to send a 403. The ones that aren't logged in I redirect to a page with "register" and "login" links.

So, should I log the bot in as a user that I have prepared and let the bot crawl inside? I can't find so much about this on the net which makes me doubt if this is the way to go. 

Another approach is of course that I whip up some public static pages that demonstrate the content that is available for registered users.


RE: sitemap.xml for auth sites - jreklund - 11-21-2019

If you want the hidden content (behind a registration wall) be crawled by Google it's the only way. It needs to be public somehow.

403 are for then you are redirecting a user to a login page, in case they hit a registration wall. Those pages should have a 403. So that Google knows that you need to be logged in to view them. But if you want them to crawl the actual page, you need to automatically let them in and return 200.

The reason you can't find any information on the web on this, are that nobody does it... Either they are public or hidden.