Tech – World Space Flight Blog

October 2, 2024

New installation of Sphider 5.5.0 shows wrong version number

It has been noted that NEW installations of Sphider 5.5.0 will show as 5.4.1 in Settings.
This does NOT affect the functionality of Version 5.5.0. It WILL present a problem in the future if you were try to upgrade! Therefore, this should be corrected.

Full details and instructions can be found in the Sphider Help Forum.

July 4, 2024July 26, 2024

Question about Starliner thruster problems

On 28 June 2024, the NYT wrote:

Beginning next week, engineers will conduct ground tests at NASA’s White Sands Test Facility in New Mexico using a thruster identical to the ones on Starliner. The firings will reproduce the ones that Starliner performed in space.

That will probably take a couple of weeks, Mr. Stich said. “Then we’ll give engineers a chance to go look at that thruster,” he said. “This will be the real opportunity to examine a thruster, just like we’ve had in space.”

(https://www.nytimes.com/2024/06/28/science/boeing-starliner-nasa-astronauts.html)

My question is this: Aren’t these the kind of tests that should have been done BEFORE even the first orbital flight test?

December 20, 2023July 26, 2024

Sphider 5.5.0 and SphiderLite 2.6.0 have been released

Sphider 5.5.0 and SphiderLite 2.6.0 have been released.

An error in which checking “Index decimals” in settings caused pages to not index was corrected. Along with this fix, the option of choosing which decimal separator is to be used, decimal period or decimal comma. Prior to this, Sphider just assumed the separator was a decimal period, which excluded half the world which uses a decimal comma. Indexing numbers in general was improved by stripping the thousands separator in large numbers. In addition, the ability to actually search for indexed decimals has been added.

Also new is the ability for sitemaps which are compressed to be read (*.xml.gz). The limitation is that the uncompressed file size must not exceed 100,000 bytes. This should still be enough to link 500 url’s per xml.gz file.

One column name in the settings table was changed to avoid a conflict with a MySQL reserved word.

There was also several code deprecation fixes applied. Testing is being done using PHP 8.3.

The Sphiderlite version got a fix for “Update settings” actually corrupting the settings file.

One thing to note about the ability to xml.gz files, the UNCOMPRESSED size of the xml file must NOT exceed 100000 bytes! If the uncompressed size exceeds 100000 bytes, the file will be invalid and totally ignored. Even with this limitation, that is still large enough to accommodate over 500 URL’s.

October 14, 2023July 26, 2024

Sphider 5.4.0, SphiderLite 2.5.0 have been released

Processing of robots.txt files has been improved. Robots.txt is now case sensitive and consideration is given to “allow” directives. All common text files have been integrated into Sphider. The user may assign a default language to a web site, but Sphider will also try to detect the language used on each page and use the appropriate common text set. A new feature is the introduction of the possibility of setting built in pauses during indexing. Running from a command prompt, user help has been updated for better instruction in the use of “must-include” and “must-not-include” directives. The possibility of having ‘index to” level being blank has been fixed.

In the full version, Sphider not obeying the “must-not-include” directives during image indexing has been corrected. Also fixed was Sphider not picking up the width, height, and alt attributes in the img tag. Additionally, ‘jpeg’, ‘webp’, and ‘svg’ files are now recognized. Support for ‘tif’ image files has been dropped. (Does anyone even use tif/tiff any more?)

The User Guide has also been updated.

September 4, 2023February 6, 2024

Sphider 5.3.0, SphiderLite 2.4.0 are released

The newest versions of Sphider are (fingers crossed) PHP 8.3 ready. More importantly,, the PHP mbstring extension is now required. In earlier editions, the mbstring functions were emulated. There also has been some code cleanup and standardization.

Sphider currently can index PDF files, DOC, PPT, and XLS files, although a third party utility is needed in a Windows environment. There is consideration that the next edition of Sphider be able to also index DOCX, RTF, and ODT files. If anyone out there thinks this would be useful, let us know.

July 11, 2022December 29, 2023

PHP and MySqlnd (to make Sphider function)

Time to revisit the issue of enabling mysqlnd in PHP in order to make Sphider function. I have given instructions in the past on how to enable mysqlnd if it isn’t already. I believe those instructions were either unclear or incomplete.

A bit of history… At one time, mysqlnd was a separate module from mysqli. That is no longer the case. Mysqlnd (mysql native driver) is built into PHP. However, some hosting companies, particularly where shared hosting is concerned, have the installation configured so as to NOT enable mysqlnd. Why this is I do not know. Why not just have it enabled from the get-go since it is a NATIVE driver and is already part of the package!

Fortunately, most users do have access to CPanel, which allows the user to change the configuration. But this can get interesting as the method is a bit counterintuitive.

To enable mysqlnd, you need to disable mysqli (which you really aren’t doing!), but then you also need to enable nd_mysqli! Are you confused yet? Trust me, this works. I have two shots of CPanel showing the CORRECT settings to get mysqlnd working on your system.

UNTICK mysqli
TICK mysqlnd
TICK nd_mysqli

Save configuration.

July 8, 2022July 26, 2024

Trouble indexing a website with Sphider

Some websites just don’t index very well. Here are some examples, and a solution if there is one.

You are trying to index “http://somesite.com/” and you get an initial 301 error and get no further. You try the hack for the fake 301 errors (my last post), but that doesn’t work. Cause? There just might be a REAL 301 error. A browser will take the “http://somesite.com/” and follow the redirect to “https://somesite.com/”! Sphider isn’t that smart. The fix is to charge Sphider to look for https instead of http.

You try to index “https://somesite.com/”, the initial page indexes, but no further pages are found, or some pages are found but not others. The likely cause is that the website uses https and http interchangeably. That might work for a browser, but not for Sphider. As a Sphider user, there isn’t much you can do except hope the website owner does some editing and makes his/her references consistent.

After the first page, some site will not index very well. A possible cause is a heavy use of JavaScript in forming the pages, particularly the references (links). Sphider does not index JavaScript. As a Sphider user, there is nothing you can do.

You are indexing a site and you get a lot of garbage results. Sphider is built for full four-byte UTF-8. Not all websites have UTF-8 pages, and that is fine because Sphider knows that and performs conversions if needed. Not every web page tells Sphider what encoding it does use, and that is okay, too. Sphider is pretty good at figuring these things out. But sometimes, thankfully not common, a web page will be written with one encoding but explicitly state that is a different encoding. For example, a page written in Windows-1252 but declaring it is UTF-8 isn’t going to be converted to UTF-8 because Sphider has been led to believe it already is! Result is going to be some strange index results. Even worse, a page is UTF-8 but says it is something else… Believe me, converting UTF-8 to UTF-8 is going to be a mess! As a Sphider user, nothing you can do about a poorly written web page.

Another scenario is that you are indexing away and Sphider suddenly quits. You investigate and finally find it exited with a PHP exhausted memory error. After looking further, you see the error occurred on a file that Sphider shouldn’t even be processing. In one instance, I had PHP crash while trying to index a .swf (flash) file. Sphider SHOULD have reported a .swf file as “Not text or html” and gone on to the next page. I tore my hair out trying to see what the issue was, and it turns out the website was sending an erroneous header report the WRONG “Content-Type:”. I had other strange halts with PHP errors on that same website, and the cause each time was an incorrect header stating the content type. Shy of writing a huge function to determine content type from the file extension instead of reading the headers sent, the only thing a user can do is identify all the page’s problem url’s and put them in the “Must not” section of the site settings. As an aside, I am clueless as to how a website can send the wrong file headers. Anyone out there have insight on this?

July 6, 2022July 26, 2024

Sphider and Sphiderlite — and 301’s!

Sphider 4.2.0 and Sphiderlite 2.2.0 have recently been released. These editions corrected a few issue which have slowly crept in. Stray white space was interfering with phrase searches, Some MySql installations (or was it PHP?) was causing some mysqli errors which resulted in dropped connections. We discovered some new code deprecation in PHP 8.1. Filters started to cause some corruption of certain Unicode characters.

Well, these recent releases corrected those issues. And even though these releases are stable, we have more improvements on the way! Sphider 4.2.1 and Sphiderlite 2.2.1 pre-identified some code deprecation from the not-yet-released PHP 8.2. We also improved identification of web page encoding. On rare occasions, a web page would throw an error during indexing due to a wrong interpretation of the page encoding. The odds of that happening have been greatly reduced. (NEVER say it can’t happen!) Also, the size of a spidering log is now displayed in the spidering log list. Look for these releases very soon!

One “issue” that remains is that SOME websites, typically WordPress sites, just refuse to be indexed! MOST WordPress sites do fine … some don’t. The very first page comes back with a “301” (relocated) error, no other pages are found, and the indexing run halts with nothing being indexed. Upon investigation, the 301 is bogus. There is no redirection. We thought maybe it is something with WordPress, but now doubt that is the case. We really don’t have a clue as to the cause. Our latest thought is MAYBE it is something done intentionally to ward off indexing by small potatoes, like Sphider?

If anyone out there knows the cause of these phony 301 errors being given to Sphider, let us know!

At any rate, those stubborn pages CAN be indexed by Sphider/Sphiderlite, using a hack. And a hack is exactly what it is … not something you would want as a normal part of Sphider. The hack can be found on the Sphider forum.

(There are other reasons for web sites that won’t index or won’t totally index, but that is for another post.)

EDIT: 7/15/2022
Found another possible cause of “fake” 301 errors! It may be that some websites do not like or recognize the User Agent string and block the crawl with a 301 error. Changing the User Agent string (in Settings) may help!

April 18, 2022February 6, 2024

Sphider 4.1.0, SphiderLite 2.1.0 are coming soon!

The next releases for Sphider are just around the corner. There are two major changes in these releases.

The first change is the REMOVAL of the re-index restart ability. WHAT? WHY? Re-index restart was introduced because of an issue in which sometimes a re-index run gets interrupted. This was an attempt to be able to do another re-index, picking up where the last one stopped. The process worked — kind of — and not in all circumstances. The first issue was that the restart HAD to be the very next thing done, and certain steps HAD to be followed. So it wasn’t user friendly. Secondly, the restart HAD to be during the SAME session, a condition which was often not met and was totally out of the users control. The reason a re-index run stops is often because the session ended! In other words, IF the restart worked, it worked nicely. But when it didn’t worked, which was often, it left a bigger mess than the original incomplete index.

For those who feel the process was something they can’t live without, Sphider 4.0.2 and SphiderLite 2.0.2 will remain available for download upon special request until such time as I can add the instructions duplicate the restart functionality to the Sphider MODS board at https://www.forum.worldspaceflight.com/.

The second change has to do with sitemaps. Sphider has had the ability to index using a sitemap, but with a caveat — the sitemap had to be a simple sitemap.xml with a list of links to pages. Many larger sites have a sitemap.xml which consists of links to other sitemaps. Sphider 4.1.0 and SphiderLite 2.1.0 can handle this. One thing to be aware of is that with a larger site, it might take a very significant amount of time for Sphider to digest these maps! Sphider may appear frozen for awhile as it works in the background. Just watch in the browser tab for signs of activity.

February 28, 2022July 26, 2024

Sphider: Indexing from sitemaps

Sphider can index using a sitemap — PROVIDED it is a traditional sitemap of url’s and not a sitemap directory listing additional sitemaps (which contain the url’s). This is popular on larger websites.

Well, we have been playing with a mod that can change that! Initial tests show that just might actually work! We have found one instance that can mess up the process and have disarmed it. The question is, are there other instances that can derail us? Only extensive testing will tell.

We will post the mod in the Sphider Help Forum, but will also provide it here.

In spiderfuncs.php, find the function getSiteMap(). Modify the function with the bold code as follows:

function getSiteMap($input_file)
{
$links = '';
$sitemap = simplexml_load_file($input_file);
if ($sitemap != '') {
$links = array ();
foreach ($sitemap as $url) {
// START MOD PART 1
// For some reason, wlwmanifest.xml interfers with the recursion
// Therefore, let's ignore it
if (preg_match("/wlwmanifest\.xml$/i", $url->loc)) {
continue;
}
if (preg_match("/\.xml$/i", $url->loc)) {
$submap = $url->loc;
foreach ($submap as $input2) {
$sitemap2 = simplexml_load_file($input2);
if ($sitemap2 != '') {
foreach ($sitemap2 as $url2) {
$links[] = ($url2->loc);
}
}
}
} else {
// END MOD PART 1
$links[] =($url->loc);
// START MOD PART 2
}
// END MOD PART 2
}
$links = explode(",", (implode(",", $links)));
}
return $links;
}

Let us know if you try this, and ESPECIALLY if there are issues!