April 5, 2017January 4, 2020 by Captain Quirk

Considering another Sphider improvement

The original version of Sphider had very erratic support for indexing HTTPS pages, and wouldn’t even look at the robots.txt file on a HTTPS site. That failing has never been addressed, and even the latest version, 1.5.2, has the same failings when it comes to HTTPS. This has never really been an issue for me before, and even now it is more annoyance than issue as I can work around it.

Still, the “problem” does seem intriguing. After a bit of experimenting, a fix may not be all that difficult. (Famous last words, right?)

I am debating now whether or not to continue investigating alternatives and make more code changes which would improve HTTPS support in Sphider, not only to ensure more reliable connectivity but to enable the robots.txt to be utilized as well. I don’t know that there is that big of a need. We’ve never received any complaints or comments on the issue…

Anyway, at this point there is a POSSIBILITY, but no definite plans one way or the other.

*******************************

UPDATE (Apr 6): I was able to get the robots.txt file read from a https site. First problem, regardless of http or https, the parsing of allowed or disallowed user agents and disallowed files/directories was iffy. If the robots.txt file had lines like “user-agent” or “disallow”, it was parsed, but “User-agent” or “Disallow” was not. It was a case issue. That is now fixed (on my side, not published yet). Second problem, now that I know the file IS being read and parsed, Sphider will STILL index some files in disallowed directories!

If you have any files or directories listed as “url_not_inc” in your settings, that will work, but not the robots.txt disallows, even though that SHOULD be the case. Well, this situation certainly has gotten my interest!

*******************************

UPDATE (Apr 7): I have begun the process of troubleshooting the code to see what is going awry and where. Working alone and having other things to do in life, this can be both time consuming and frustrating. So far, I do know the robots.txt is read and parsed properly. Just where and why the instructions are not acted upon is another matter. At least the question of whether or not I will be attempting another modification has been answered!

*******************************

UPDATE (Apr 8): GOT IT! Preliminary tests show robots.txt is now being followed in both http and https. More testing to follow (found a couple other misc issues and fixed them). Once everything is validated, there will be a 1.5.3. Stay tuned.

6 Replies to “Considering another Sphider improvement”

Bleery says:

April 6, 2017 at 5:19 PM

That would be awesome! Ever since Google added the extra ranking signal for having https more and more sites are moving towards it, especially with Cloudflare offering the free flex ssl certs. Would love to see it implemented, but even if not thank for the awesome work!

You should consider letting the people at the old Sphider forums know you have an updated and working version.
1. Captain Quirk says:
  
  April 6, 2017 at 5:50 PM
  
  Replying to your post in a reverse order, I have TRIED to let people know about my updates through the Sphider forum. The catch is that all posts there are moderated by “Tec”. Tec, it seems, has developed his own updated Sphider called Sphider-Plus, but it isn’t free. For him, it’s a for profit business. No matter how diplomatic, polite, non-agressive, and agreeable my posts are, they rared get approved. Whenever anyone has a problem, his replay is along the lines of “Well, Sphider is outdated and not being maintained anymore, but if you get Sphider-Plus everything will be fine.” And practically nothing gets posted on the Sphider Mods side of the forum. Oh, well.
  
  As to my version supporting HTTPS, I made a small change to a SINGLE LINE OF CODE and it looks like robots.txt is recognized and followed for HTTPS! Worldspaceflight.com is all HTTPS (in fact, you can’t even access it by HTTP. You get redirected to HTTPS.) The last time I had done a full scan was when it was still optional. I modified the site in Sphider and attempted a re-index and got an initial “NO HOST”, then a list of “http://www.worldspaceflight.com/…” with the “Page contains fewer than 10 words”. I backed up the database (just in case), deleted my website from Sphider (which essentially gave me an empty database), recreated the entry as an https site, made the one line code change, and successfully indexed the entire site (1800+ pages).
  
  It seems too simple, so I am trying a few more tests and if I can fully understand what is going on and why I initially got the “NO HOST” message, I will probably take it live. The only thing that really bothers me is the initial “NO HOST” deal. Was it a fluke? It happened TWICE in a row, so I doubt that. Was it because I changed the criteria in Sphider and tried to RE-index? Possibly. I want to play with this a bit more.
  
  The fact that people DO seem to be interested is at least a motivation to see this through.
  1. Casey Gadd says:
    
    October 4, 2018 at 2:26 PM
    
    What is the status of this issue? I am running into this issue. how can I fix it?
    1. Captain Quirk says:
      
      October 4, 2018 at 5:32 PM
      
      The issue of following the robots.txt file for both http and https was resolved in version 1.5.3. Current version is 2.0.0.
      For normal indexing of web pages, the robots.txt IS followed. There are no options to disable this. When indexing images, following the robots.txt is enabled by default. There is an option on the Index tab to disable this.
      
      Specifically, what is the issue you are encountering? Is the site not being indexed at all? Or is the indexing failing to follow a robots.txt and indexing everything?
      Either issue MAY be caused by the robots.txt file itself. The first case (no indexing) may be a robots.txt directive blocking Sphider.
      The second case MAY be caused by an improperly coded robots.txt file.
      
      Soooo…
      Let me know the specifics of the problem. A link to the site you are trying to index would be helpful so I can see what is happening and try to duplicate the issue. If you prefer NOT to publicly post a link to the site, you may still send it to me via PM on the forum. If you own the site being indexed, a copy of the robots.txt may also be useful just to save time.
J D Warthen says:

April 27, 2017 at 8:44 PM

I have been following your Spider updates, and I must say that all of this is way above my pay grade. 😉
I am working with a webmaster who has created a site for our non-profit club, and I often post files to sync our monthly newsletter data with archives in the site. We are interested in a Search Tool for the website which will search 50+ years of newsletters for various keywords. We discovered Spider on the web along with your updates.
Our biggest concern is whether or not the use of your newest version would decrease or maintain the security of our data. Our fear is that a search tool would open the door to hackers.
Can you address this concern for us?
Thank you
1. Captain Quirk says:
  
  April 27, 2017 at 9:40 PM
  
  The original Sphider, 1.3.6, which is what I started with was FULL of security holes. In the early updates I have made, just bringing Sphider back to a functional status was a priority. After that was achieved, security issues began to be addressed. The 1.5.x releases now use what is called structured SQL statements. This makes SQL injection virtually impossible. In addition, escape mechanisms have been used to sanitize any malicious user inputs. Code has been changed to prevent the accidental display of PHP code to anyone using a browser to view source code. The pages are all written in PHP (with the exception of header.html) so all of the html a user could see is generated and not the ACTUAL source.
  
  It is 100% secure and fool proof? No, but there are no web sites which are. There are a few additional steps to make it even more secure from your end. For one thing, you can password protect the admin directory (contained within sphider itself). A second step would be to make your website accessible ONLY by https. A third thing is to move a couple specific pieces to a location ABOVE your web root. This is a step which only the paranoid need to do, and then only if they are NOT using shared hosting and have a dedicated server. Anyone interested in this may contact me for instructions.
  
  I hope this addresses your concerns. In my opinion, your data would remain secure and no normal hacker is going to get in. I closed and locked every door I could find. The NSA— well, that may be another story! LOL!

Comments are closed.