CodeIgniter Forums

Full Version: sitemap.xml for auth sites
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
I have created a sitemap.xml for a project with an auth system. So far I have just added the pages outside that wall. Is there a way to let crawlers login and crawl on the pages that requires authentication? If so, I like to add that pages in my sitemap. 
I am trying to get an app SEO friendly. 

Eh, not much Codeigniter in there, but what could possibly go wrong by posting to this friendly community?  Angel
No, it can only crawl public pages.
(11-16-2019, 02:14 PM)jreklund Wrote: [ -> ]No, it can only crawl public pages.
The reason that I ask is that I found one page out there in the wild that were talking about creating a user account for crawlers with read-only permissions. The idea was to in a public file instruct a crawler to login with that accounts username/password. Maybe not a good practice from a security point of view, but maybe seo-friendly.
Also, if you use something like Google Analytics, at least Google gets the uris from the logged-in pages. I wonder if that could confuse google. I mean I have my sitemap with non-loggedin uris only, but google sees that the page has much more uris than mentioned in the sitemap.xlm.
You can't make an user/password, that would be a security hole. And they don't even pick up on that.
What you need to to is look for a googlebot user agent, and verify it's origin. And let that bot inside without username/password.

https://support.google.com/webmasters/an...0553?hl=en

So, it will be semi-public.

If Google starts crawling those pages (you got Google Search Console right?) you should see those urls failing. You should return a correct header for Google and other bots if they find a page that require login. Like 403. If you return 200, they will think it's the correct content showing up.
(11-16-2019, 05:06 PM)muuucho Wrote: [ -> ]The reason that I ask is that I found one page out there in the wild that were talking about creating a user account for crawlers with read-only permissions.

This feels a little like a social engineering exploit. Actually, it feels a lot like one.
(11-17-2019, 06:53 AM)dave friend Wrote: [ -> ]
(11-16-2019, 05:06 PM)muuucho Wrote: [ -> ]The reason that I ask is that I found one page out there in the wild that were talking about creating a user account for crawlers with read-only permissions.

This feels a little like a social engineering exploit. Actually, it feels a lot like one.
Well, it was in a foreign language to me and I just read i briefly, so I might have misunderstood the concept.

(11-17-2019, 04:14 AM)jreklund Wrote: [ -> ]You can't make an user/password, that would be a security hole. And they don't even pick up on that.
What you need to to is look for a googlebot user agent, and verify it's origin. And let that bot inside without username/password.

https://support.google.com/webmasters/an...0553?hl=en

So, it will be semi-public.

If Google starts crawling those pages (you got Google Search Console right?) you should see those urls failing. You should return a correct header for Google and other bots if they find a page that require login. Like 403. If you return 200, they will think it's the correct content showing up.
OK, this is way ahead of what I have been doing before. I will try to learn more about the concept before I give it a try. Thanks for your effort!
(11-17-2019, 04:14 AM)jreklund Wrote: [ -> ]You can't make an user/password, that would be a security hole. And they don't even pick up on that.
What you need to to is look for a googlebot user agent, and verify it's origin. And let that bot inside without username/password.

https://support.google.com/webmasters/an...0553?hl=en

So, it will be semi-public.

If Google starts crawling those pages (you got Google Search Console right?) you should see those urls failing. You should return a correct header for Google and other bots if they find a page that require login. Like 403. If you return 200, they will think it's the correct content showing up.
So, I create an account for Googlebot in my auth system(I make up an email and a password). 
Should I then do something like this in MY_Controller: (code updated)
PHP Code:
// Only check if  not logged in users are a google bot
if(!$this->ion_auth->logged_in()) {    
    
// Verify if caller is a google bot
    $remote_ip $_SERVER['REMOTE_ADDR'];
    $domain gethostbyaddr($remote_ip);
    if (in_array($domain, ['googlebot.com''google.com']) && gethostbyname($domain) == $remote_ip) {
        // Caller is a google bot, login the caller and redirect to some logged_in page where the "inside" crawl can start...
        $this->ion_auth->login('made_up_email''made_up_password');
        redirect('auth/logged_in_start');
    

// Run script... 
Nope, all your urls will contain whatever "auth/logged_in_start" is. You should let them in to the original content. Don't display a 403, that's respons should be for everyone accessing a valid url and aren't logged in.
(11-21-2019, 11:13 AM)jreklund Wrote: [ -> ]Nope, all your urls will contain whatever "auth/logged_in_start" is. You should let them in to the original content. Don't display a 403, that's respons should be for everyone accessing a valid url and aren't logged in.
Sorry, I thought you suggested to send a 403. The ones that aren't logged in I redirect to a page with "register" and "login" links.

So, should I log the bot in as a user that I have prepared and let the bot crawl inside? I can't find so much about this on the net which makes me doubt if this is the way to go. 

Another approach is of course that I whip up some public static pages that demonstrate the content that is available for registered users.
If you want the hidden content (behind a registration wall) be crawled by Google it's the only way. It needs to be public somehow.

403 are for then you are redirecting a user to a login page, in case they hit a registration wall. Those pages should have a 403. So that Google knows that you need to be logged in to view them. But if you want them to crawl the actual page, you need to automatically let them in and return 200.

The reason you can't find any information on the web on this, are that nobody does it... Either they are public or hidden.
Pages: 1 2