Last post Jun 14, 2007 03:43 PM by Rhysmil
Apr 11, 2007 04:14 PM|trint99|LINK
When we rolled out the new version of our web site, using ASP.NET 2.0 and a new content management system, we got hit hard with the "Cannot use a leading .. to exit above the top directory" error. Tracing the offending IP addresses showed that the error
was almost always produced by a search engine crawler. I spent a about a week googling the error message and finally found this forum post that recommended a web.config fix:
The fix involves adding the UseCookies value to your forms authentication entry. This got rid of the error message and for the last few months the issue has been out of sight, out of mind. Then a co-worker sent me this post with the same symptoms but a
different, more complicated fix:
This fix involves writing browser handlers for every possible offending User Agent. (Not something I'm looking forward to.) This latter article seems to have a much better handle on what the problem is and also mentions the effect this bug can have on
your SEO. Now I'm left with a conundrum. I'm no longer getting the 500 error in my event log thanks to the UseCookies setting. But could the error, or even this simple fix, actually be breaking search engine crawls of the site without posting the error
in my event log? It appears that google is crawling our site, but I'd hate to think that I have a phantom error that's hurting my SEO performance.
I would like to get a more educated explanation of why the UseCookies fix works. I really have no idea how this stops the error. I'd also like some input from folks wiser than me on if/why the browser file approach is better and worth the effort and if
the UseCookies approach might actually be silently failing for crawlers.
Any wisdom on the subject would be much appreciated.
Apr 17, 2007 04:45 PM|trint99|LINK
Surely someone knows something about this!!
Apr 18, 2007 04:06 PM|Svante|LINK
Surely someone knows something about this!!
Well, yes, but not enough to give you any definitive answers.
My understanding of the problem is that one root of it lies in the way that the RewritePath method is described in various documentation sources, including Microsoft - in combination with what appears to be a bug or strange behavior of the Html32TextWriter.
I don't have the time to write this up as a pedagogical article, so read carefully... ;-) Here goes:
When you use RewritePath in a HttpModule, you rewrite the incoming URL of the request before it's actually processed by it's associated HttpHandler. This is what makes it possible to retarget the request to a different resource than the
original URL actually indicates. The problem with this is that we now have a situation where the browser and ASP.NET have different views on where they are. This in turn affects how relative URL's are created and interpreted, specifically it affects how URL's
should be rendered in the the response.
Consider the example from the excellent article at
http://todotnet.com/archive/0001/01/01/7472.aspx which you provided the link for. I've not checked the details of it, but it seems well researched and as far as I can tell without actually trying it out should be correct.
We rewrite http://www.mysite.com/myfolder/mypage.aspx which is rewritten to
Now, the browser thinks we're viewing mypage.aspx, located at /myfolder/ at the host
www.mysite.com using the HTTP scheme. But what does ASP.NET think? This is where the rebaseClientPath parameter comes in. When this is false, ASP.NET will assume the browser is still pointed at /myfolder/, and when servering
the request at page.aspx, will have to change the action tag URL to "../page.aspx", since it still thinks the browser is located at "/myfolder/" - which of course also is correct in one sense. In another sense it is certainly very incorrect. With rebaseClientPath
set to true, you're telling ASP.NET that it should assume the browser is pointed at the same URL as the now rewritten request, i.e. at the root "/" at the host
www.mysite.com etc (the rewritten request is to
http://mysite.com/page.aspx?id=mypage, so now ASP.NET will work as if this is what was written in the adress bar of the browser). Now, the action tag is written as "page.aspx". Also correct from the internal
point of view - but incorrect if sent to the browser.
So most will set rebaseClientPath to false, and all will sort of seem well, more or less.
It is the handling of "../page.aspx" and similar that apparently the problem with Html32TextWriter shows itself. This is indeed probably a bug in Html32TextWriter. Probably it does not use the rebased client path, but simply assumes the rewritten request
is what it should look at, so when it tries to fixup the ../page.aspx?id=mypage path, it thinks you're trying to exit above the site root. Boom. However, even this is making the problem all too easy.
The real problem here is that RewritePath in general terms, does not solve the problem of URL rewriting. This is because it cannot fix the outgoing HTML correctly, except in a few special cases like the case shown. Just looking at the action tag, the correct
URL for it is of course "mypage.aspx" (or equivalently clearer, but longer, /myfolder/mypage.aspx),
not "../page.aspx?id=mypage" (rebaseClientPath == false) or "page.aspx?id=mypage" (rebaseClientPath == true) .
So, to actually implement correct URL rewriting, you can't just settle for rewriting the incoming URL and let ASP.NET generate relative paths to match, since this will cause postbacks to post to the rewritten URL, not the original one. In the end, real URL
rewriting requires you to implement a full reverse proxy functionality, which includes rewriting the outgoing HTML as well. You can do this by cheating and using regular expressions etc to find what needs to be rewritten, but this will never *really* work,
so you need to parse the outgoing HTML properly. All in all, not a trivial task - but certainly doable. I've done it, using the SgmlReader found at
http://www.gotdotnet.com/Community/UserSamples/Details.aspx?SampleGuid=b90fddce-e60d-43f8-a5c4-c3bd760564bc as the basis. This is far from a perfect parser, but with a few tweaks and fixes it does the job. There may be others around that are better suited.
Back to the problem at hand.
You need to either avoid using Html32TextWriter, as is detailed in the reference article, by editing the browser definitions - or set 'rebaseClientPath' to true, and then fixup the outgoing HTML yourself. You can hook yourself into the outgoing stream by
setting the Filter property of the HttpResponse object. Setting the 'rebaseClientPath' to true should ensure that any relative URL's produced by ASP.NET will be correct in relation to the rewritten URL, thus avoiding the problem.
The reason why the cookie setting affects this issue is probably related in some way to the fact that RewritePath is mentioned in a blog as used by ASP.NET to rewrite the path when using cookieless sessions - but I have not investigated this any further,
so it's more of a theory.
Apr 18, 2007 06:37 PM|trint99|LINK
Great explanation (if a bit hard to wrap my brain around). Thanks. And as for your theory on the cookieless sessions, a theory is better than what I had before, so thanks again!
Can anyone attest to the fact that adding the UseCookies value will NOT have adverse effects on SEO? That's my primary concern with the method. I can attest to the fact that we have not seen the ".." error even once since implementing the UseCookies
Jun 14, 2007 03:43 PM|Rhysmil|LINK
Just a guess; if the cookiesession returns a good path to the crawler, everything is fine. This guess is based on the really good earlier post.