Sphider 4.1.0, SphiderLite 2.1.0 are coming soon!

The next releases for Sphider are just around the corner.  There are two major changes in these releases.

The first change is the REMOVAL of the re-index restart ability. WHAT? WHY? Re-index restart was introduced because of an issue in which sometimes a re-index run gets interrupted. This was an attempt to be able to do another re-index, picking up where the last one stopped. The process worked — kind of — and not in all circumstances. The first issue was that the restart HAD to be the very next thing done, and certain steps HAD to be followed. So it wasn’t user friendly. Secondly, the restart HAD to be during the SAME session, a condition which was often not met and was totally out of the users control. The reason a re-index run stops is often because the session ended! In other words, IF the restart worked, it worked nicely. But when it didn’t worked, which was often, it left a bigger mess than the original incomplete index.

For those who feel the process was something they can’t live without, Sphider 4.0.2 and SphiderLite 2.0.2 will remain available for download upon special request until such time as I can add the instructions duplicate the restart functionality to the Sphider MODS board at https://www.forum.worldspaceflight.com/.

The second change has to do with sitemaps. Sphider has had the ability to index using a sitemap, but with a caveat — the sitemap had to be a simple sitemap.xml with a list of links to pages. Many larger sites have a sitemap.xml which consists of links to other sitemaps. Sphider 4.1.0 and SphiderLite 2.1.0 can handle this. One thing to be aware of is that with a larger site, it might take a very significant amount of time for Sphider to digest these maps! Sphider may appear frozen for awhile as it works in the background. Just watch in the browser tab for signs of activity.

Sphider: Indexing from sitemaps

Sphider can index using a sitemap — PROVIDED it is a traditional sitemap of url’s and not a sitemap directory listing additional sitemaps (which contain the url’s). This is popular on larger websites.

Well, we have been playing with a mod that can change that! Initial tests show that just might actually work! We have found one instance that can mess up the process and have disarmed it. The question is, are there other instances that can derail us? Only extensive testing will tell.

We will post the mod in the Sphider Help Forum, but will also provide it here.

In spiderfuncs.php, find the function getSiteMap(). Modify the function with the bold code as follows:

function getSiteMap($input_file)
{
$links = '';
$sitemap = simplexml_load_file($input_file);
if ($sitemap != '') {
$links = array ();
foreach ($sitemap as $url) {
// START MOD PART 1
// For some reason, wlwmanifest.xml interfers with the recursion
// Therefore, let's ignore it
if (preg_match("/wlwmanifest\.xml$/i", $url->loc)) {
continue;
}
if (preg_match("/\.xml$/i", $url->loc)) {
$submap = $url->loc;
foreach ($submap as $input2) {
$sitemap2 = simplexml_load_file($input2);
if ($sitemap2 != '') {
foreach ($sitemap2 as $url2) {
$links[] = ($url2->loc);
}
}
}
} else {
// END MOD PART 1
$links[] =($url->loc);
// START MOD PART 2
}
// END MOD PART 2
}
$links = explode(",", (implode(",", $links)));
}
return $links;
}

Let us know if you try this, and ESPECIALLY if there are issues!

Is Sphider obsolete?

IS Sphider obsolete!?

NO! At least not yet. Sphider is on the road to obsolescence, but it’s not there just yet. Before going into further detail, I wish to point out a few things.

The intended use for Sphider is for a website to have an internal search feature for that particular site. Sphider is, and never was, intended to be a personal Google, Bing, Yahoo, Yandex, or any other search engine. Yes, it is capable of indexing more than a single website, but even there it is intended for indexing perhaps a family of related sites.

Next, keep in mind that Sphider first debuted when the web was a much simpler place. Websites consisted of a series of files and sub-directories (some might refer to them as folders). A website had a home page, often named “index.html”, or “index.php”, or “index.aspx”.  There might be a directory named “products” and files in that directory like “product1.htm” and “product2.htm”. You would access these pages from a browser with something like ‘http://bigfactory/products/product1.htm”. For many websites today, this is still a valid scenario. Maybe “https” has largely replaced “http”, but it is still the same concept.

The reality, though, is that the web is changing. Take this blog, for example. It uses WordPress, and in a pretty basic, almost primitive, way. There are quite a number of pages. There is even a contact page, which judging from what appears at the top of the browser, is located in a directory named “contact”. But you know what? There is no such directory! There is an “index.php”, but it doesn’t contain anything like what you see on the home page of this blog. Since this blog is not very complex in the way it is laid out, Sphider can index it, although the results are rather messy! That is okay, since WordPress has it’s own search functionality if the user wishes to implement it.

You will notice that the “downloads” page of this blog has a url of “https://www.blog.worldspaceflight.com/downloads/”. There is no name in the traditional sense, no page extension (htm, php, etc.). It looks like it is a directory, so the default would be “index.php” or something? NOPE!

This isn’t just a WordPress thing. This is the future of the internet. As time goes by, more and more websites are going to become like this. Cpanel settings, htaccess settings, iframes, api’s, server configurations… These all are evolving.

So what does all this have to do with Sphider? Sphider uses old technology, technology which is still in large use today. But that use is diminishing. Sphider is going to try to index some websites and immediately end with a “Relocation: 301” message and never get a step further. So why can’t Sphider simply follow the 301 and start indexing that page? Because it is a 301 that only Sphider can see. It isn’t a REAL 301. There is no redirect header, no redirect in htaccess (Apache servers). This is all in configuration. Sphider needs a file name, and increasing there just simply is no file name. It’s a modern website using features Sphider is not equipped to handle.

So is Sphider dead? No. Is Sphider dying? No, Sphider is not dying, but the universe in which it works is definitely shrinking. As websites evolve, the number of websites able to utilize Sphider is going to decrease.

So what about Sphider now? What is its future?

I don’t see any feature changes or additions in the future. Sphider will continue to be supported and updated to keep up with the technology it does use. Sphider works with PHP 8.1 and MySQL 8. As PHP evolves, Sphider will keep pace. The same goes for MySQL. Sphider will keep up. If any hidden flaws are found in the code, it will be corrected. If security issues are detected, we will attempt to address them.

As the web evolves, there may come a time in five or ten years, when Sphider becomes a quality buggy whip in a Tesla world. Even then you will still be able to find it residing in some antique software repository. But it isn’t quite ready to hang it all up just yet.

Sphider 4.0.0-MB and SphiderLite 2.0.0 released

The backup and restore utilities have been reworked to use MySQL directly. This provides higher dependability than depending on PHP.  Also, a limited ability to resume a re-index process which has been interrupted has been introduced. The process to determine page character set has been enhanced. Language file conversion to Unicode has been completed. Obsolete versions of code have been removed and general code cleanup done. Further safeguards against indexing of illegal characters has been implemented. SphiderLite has had more remnants of the full version removed.

Critical update to SphiderLite, SphiderLite 1.3.1 released

A critical flaw was discovered in SphiderLite which affected the indexing of URLs. This has been corrected in 1.3.1.

All users of previous versions of SphiderLite are urged to STOP USING IMMEDIATELY and upgrade to version 1.3.1. It does seem that initial indexing with no special circumstances was successful in prior versions, re-indexing was adversely affected. If Sphider was allowed to leave the domain, there could also have been adverse effects.

The cause of the problem? SphiderLite is a scaled down version of the full featured Sphider. When the full version was scaled down, some function parameters unique to the the full version were inadvertently retained, throwing off the parameter sequence in the Lite version.

SphiderLite 1.3.1, in addition to fixing this critical flaw, has also improved the method of determining a page’s character set, improved filtering of emojis, and improved filtering of unwanted characters in the indexing of keywords.

Sphider 3.6.0-MB is UNAFFECTED by the flaw discovered in SphiderLite. It will, however, soon be updated with the same additional improvements in this Lite version.

Sphider 3.6.0-MB, SphiderLite 1.3.0 released

A potential runaway regular expression resulting in missing titles has been corrected. Crawl performance has been improved by fixing a bug that caused Sphider to try to crawl pages returning codes like 301, 401, 403, and 404. The absence of a robots.txt file on sites being crawled was generating warning errors, and this has been corrected. More potential PHP 8 errors have been averted. More obsolete code has been removed. The MB version now reports when a feed becomes invalid.

Sphider 3.5.2-MB and SphiderLite 1.2.2 are released

A change to how Sphider does searches was very recently implemented. It was found, however, that everything worked fine, PROVIDED all searches yielded results! If a search should not have any valid results, a message saying “The search for [search] yielded no documents” is supposed to be displayed. Instead, such a search actually presented the results from the last successful search! NOT GOOD!

It was found that session variables were not being cleared in the event a search yielded no results. That has been corrected.

Sphider 3.5.1-MB, SphiderLite 1.2.1

Sphider is multibyte capable. It has been since Sphider 3.0.0-MB. Sphider 3.0.0 only worked on installations which had the PHP mbstring module installed. Sphider 3.1.0-MB and later works on all installations, and if the mbstring module was missing, it was emulated. Obviously, if the module was present, emulation was not needed. Well, one emulated function neglected to check for mbstring and ALWAYS emulated the function. This version corrects that. The result is that if an installation has mbstring installed, searches will run faster than before.

Versions prior to Sphider 3.5.1 and SphiderLite 1.2.1 will work, but just not nearly as efficiently.

Sphider 3.5.0-MB, SphiderLite 1.2.0 released

The text search feature has been updated to provide more efficiency and quicker response. Previously, a search was repeated for every page of results. This has been changed so that a particular search is performed only once. Then the appropriate subset of results is displayed for each page. This does not improve searches with only 1 page of results, but each page thereafter will see an improvement.

Sphider 3.4.5-MB, SphiderLite 1.1.5 released

This release fixes problems with robots.txt files, removes obsolete database functions, and removes code deprecated in PHP 7.4 and removed in PHP 8. Barring surprises, both versions of Sphider should be PHP 8 compatible.

Maintenance releases for the PDO version of Sphider have ended and the PDO version has been removed from the general download library.