Sphider 3.1.0-MB and Sphider 2.4.2-PDO released

Sphider 3.1.0-MB is multibyte capable, like 3.0.0-MB. However, 3.1.0-MB does NOT require the PHP mbstring extension. Mbstring is recommended, but not required. If it is available, it will be used. If not, Sphider will emulate the mulitbyte character string functions. Also, 3.1.0-MB continues the improvements always being made to the original fork. Since there is no longer any special requirements other than the typical MySQLi/MySQLnd extensions, there is no longer a need for the 2.4.x line.

Sphider 2.4.2-PDO provides a fix for a problem with 2.4.1-PDO which could cause some UTF-8 characters to be mistaken for ISO-8859-1 characters. The resulting “conversion” produced rubbish. The PDO fork will continue to be available and supported, but no further product enhancements are anticipated.

What’s next for Sphider?

Sphider 2.4.0 is barely out the door, and thoughts are already turning to — “What next?”

There actually are some plans well in the works. Sphider 2.4.1 will be pretty low impact. The “major” change will be in sql error reporting when a statement preparation fails. At this point, an sql statement should never fail, but in the off chance one ever does, better to have a meaningful error message! A second very minor change will improve utf8 text handling.

Thought has been given to the status of the PDO edition. The fact that may people, particularly those on shared hosting, are “forced” into using PDO dictates that the edition should continue to be available. At the same time, PDO users tend to be smaller in scope and less demanding in requirements that others (who tend to be either fully hosted or self hosted). With these thoughts in mind, the PDO edition will continue to be supported and there may even be minor updates from time to time, but major updates in functionality will be discontinued.

Now… the regular/classic/legacy edition will continue on. There is a NEW fork in the works, also. Sphider was been constantly improving with the use of unicode (utf8 variety), but there is still one stumbling block. Unicode has multi-byte characters and character strings, Standard PHP string functions aren’t equipped to handle multi-byte characters/strings. The mbstring module for PHP is equipped… that’s what the “mb” part of the name means — “multi-byte”. The problem is, not every installation of PHP comes with mbstring.

For the time being, the “normal” Sphider will use standard PHP string handling functions, with the drawback that indexing and searching of multi-byte strings may be unpredictable. Sphider 3-MB has replaced all standard string handling with multi-byte string handling, with the drawback that it won’t work for all clients.

Once again, there will be two editions of Sphider — standard string handling, and multi-byte string handling. The eventual goal will be to merge the two so that if mbstring is available, it will be used. If mbstring isn’t available, some custom functions will try to achieve the same result.

Sphider 3-MB will require a MINIMUM of MySQL server 5.5.3. Recommended MySQL server is 5.6 or better. Utf8mb4 is NOT supported in MySQL server versions earlier tha 5.5.3. Sphider 3-MB will also require that PHP have both MySQLnd and mbstring installed and available. Sphider 3-MB will be available by the end of April or early May. A test script will be provided so that support can be determined before installation.

Speaking of MySQL server and utf8mb4, Sphider 2.3+, both standard and PDO, use utf8mb4, so they too require MySQL server 5.5.3+. IF you happen to have a lower version of MySQL server, and are unable to upgrade, we can provide an earlier version (2.2.0) of Sphider upon request.  (Specify standard or PDO.)

Emojis revisited

Not very long ago, I wondered whether or not Sphider still needed to scan for, and remove, emojis. This came about because of a change in the database from 3-byte utf to 4-byte. Upon testing, the scan and removal of emojis will continue. Sphider, and more particularly, MySQL, just doesn’t like emojis. When trying to store any full text containing an emoji, an SQL exception is thrown and the page is not stored.

The earlier issue with the function that was reported is due to the removeEmoji() function operating on an utf-8 level, and the probability of the input NOT being utf-8. For future releases of Sphider, this function will be executing AFTER it is (nearly) guaranteed that the input will be utf-8. (I say “nearly” because there are no guarantees in this world where code is involved.)

It was also noted that the function, as currently implemented, is somewhat outdated.  While updated it is possible, the function would become a bit  unwieldy.  Leaving it alone is practical, however. This is because pages containing emojis are, while not rare, relatively uncommon. And withing the pages that DO contain an emoji, the odds are that emoji is of the simpler, more common type. The kind an expanded filter would catch ARE rare in web pages, being more likely to occur in messaging applications used in smart phones and tablets. In other words, why add the complexity to Sphider to catch something that the vast majority of users are never going to encounter?

Maybe someday I will once again update the database collation to use utf8mb4_unicode_ci as opposed to the current utf8mb4_general_ci, which should allow these emojis, but even if I do, there will probably be a setting to exclude them anyway.

.

The future of the PDO edition of Sphider…

Sphider comes in two editions, the legacy version and a PDO version. The legacy version is definitely the more stable, faster, easier to maintain version. The PDO version exists primarily for those who are restricted by their shared hosting providers.

Shared hosting has its advantages in that it is very cost effective (cheap) and very simple to use. It is great for personal use or for small businesses or organizations just getting started on the web.

But shared hosting has its downsides, too. It isn’t nearly as efficient, isn’t as secure, suffers from limited resources, and has limited functionality. One of the features commonly lacking in shared hosting is MySQLnd. Thus the need for PDO.

The are quite a few users of the PDO edition, and to simply drop PDO would be a great disservice. On the other hand, trying to keep the PDO edition in sync with the legacy edition is getting harder and requiring much time and effort.

The PDO version, as it stands, is quite usable. It is PHP 7.3 compliant, so it should be reasonably set for awhile, as the majority of shared hosting plans are still at least a few versions behind 7.3!

The thought is that the time for legacy and PDO to part paths, with most future effort going into the legacy edition. Because of the user base, PDO version 2.4.0 would remain and receive hot fixes as needed.

No decision has been made and feedback will be given consideration.

Emojis and Sphider

Quite sometime back, Sphider had an indexing issue when emojis were encountered on a web page. The sql errors would fly! The solution at that time was to filter out emojis before storing in the database. This solution was working just fine, but admittedly the filter has not been updated and there are ALWAYS new emojis making their appearance.

While even the new emojis themselves have not been an issue, there was a very curious case of an emoji-free site in which the filter was clearing the entire full text of pages and storing — NOTHING! Well, that isn’t good. The workaround for that site was to disable the emoji removal function. Not an ideal fix, but very doable. As to WHY the function has this effect on that particular site is still a mystery.

But now may be the time to revisit the need for the filter in the first place. At the time the filter was installed, Sphider used the default MySQL utf8 scheme, which is 3-byte. Some emojis are 3-byte, but the vast majority are 4-byte, with even a few 8-byte emojis. You see the problem, don’t you? MySQL is not going to be happy when you try to stick a 4-byte character into 3 bytes!

Since that time, however, Sphider has moved to utf8_mb4, which IS 4-byte. This means that the troublesome 4-byte characters WILL fit into the database. As to those 8-byte emojis, well they are commonly composed of TWO 4 byte characters, which means — NO PROBLEM!

The next version of Sphider, 2.4, is VERY near release. The emoji filter remains in place. But after serious thought and consideration, and some testing, and this filter may be removed in the following release.  It is logical, but how will it test out?

What to expect in Sphider 2.4.0

Sphider 2.4.0 is on track for an April 10th release. For the user, the changes are focused on cosmetics. Up until this point, search results ALWAYS had a result number and, after the description, a text url to the page containing the search result. In 2.4.0, you will have the option to either display or not to display those items. Also, the option to display the page’s indexing date has been added.

As to search templates, what were probably seven of the crappiest, lamest templates to have ever seen the light of day have been scrapped. Seven NEW templates are being introduced. Depending on your tastes, you might consider some of them crappy, too, but at least they have a bit of style to them. The “newspaper” template was introduced in an earlier post. Here are the other six:

“black” template
“green” template
“grey” template
“simple” template
“terminal” template
“yellow” template

The “green” style is, well, VERY GREEN! The purpose isn’t so much for actual use as to demonstrate the ability and flexibility of CSS in creating your own templates, even using an image as a border.

The “yellow” template features a bit of simple artwork in the upper left corner. This artwork is “logo.png”, located in the templates/yellow directory. The size is 150×150 and has a transparent background. By creating your own similarly sized logo/picture/artwork, and replacing “logo.png”, this template can be customized for your website.

Since everyone has different tastes, different needs, and every website is somewhat unique, these templates can serve as guides in customizing your own templates. With all the above, the ONLY thing different is the CSS.  Start with a copy of the “standard” template and start tweaking away! The basic Sphider modules remain the same.

Additionally in Sphider 2.4.0, the ‘settings’ table has been completely reworked. While this change is transparent to the user, it will make life much easier on the developer as Sphider moves forward.

Besides some minor fixes and tweaks, the only other big change is in the word stemming process. While the majority of Sphider users probably never use word stemming, those who do will be pleased to learn that the algorithm (for English) has been updated to Porter2. Completely new is the ability to use stemming for ten other languages!

The next Sphider is in the pipeline

Sphider 2.3.1 is brand new, but work has already begun on 2.4.0.

Among the features already being implemented are the ability to hide the result number when displaying search results. Also, for the regular text search, the option to display the index date is being added. (This will not be available for the image or RSS searches.) The RSS and image searches will have the option to turn off the advanced search features.

A new template is being added. Unlike nearly all the current templates, this one has some class. Here is a screen shot:

The Newspaper template

In the sample above, in “settings” the result number is turned off, the index date is turned on, and the description length has been increased to 1000.

Probably the biggest change will be transparent to the user. The “settings” table is being reworked. As Sphider has changed, so has the table, with new columns being appended on a regular basis. Now, while the position of columns within a table is totally immaterial to functionality, after awhile it can be really confusing for the developer having to bounce all over the place to gather data.  This change will organize the data in a regular flow which will be much easier to maintain going forward.

Other improvements are also being considered, but whether or not they are implemented at this time is yet to be determined. No release date has been set.

When 2.4.0 is released, whenever that may be, the downloads for the SQLite and PostgreSQL versions will likely be removed due to lack of demand.

Also, earlier thoughts of adding audio (mp3, wav, ogg) indexing support to Sphider have been dropped, also due to lack of demand. The actual indexing algorithm has been proven and sketched out, but there is no rationale for implementing it other than “Gee, that’s a neat feature.”

Sphider 2.3.1 Released

Sphider 2.3.0 principally addressed security concerns, but it also was intended to bring Sphider into PHP 7.2 compliance by removing any use of the deprecated each() function. The function was used extensively, and the majority of the code replacement was very run-of-the-mill straightforward. There were four times the usage was atypical. Substitute code was put in place and tested. It seemed all worked well as many sites were indexed and searches performed as expected.

Well! It seems indexing and searching was being done properly — but only for words composed of Western characters. Words utilizing non-Western characters were not being indexed! And any searches for those words not only returned as “not found” (expected since they weren’t indexed), those searches also complained of gibberish characters/words being either too short or too common.

Investigation of the issue led to three of the four code segments replacing the non-standard usage of the deprecated each() function. The code replacements themselves have been replaced in 2.3.1. Testing on the problem sites now shows that all words are being indexed, those containing Western characters as well as those containing non-Western characters. The search anomalies are gone and searches for non-Western foreign languages is yielding expected results.  If a search word really IS too short or too common, it is reported as such, and not as gibberish. Sphider is now truly PHP 7.2 compliant.

Sphider 2.3.1, both legacy and PDO, are available for download on this blog’s download page, or from the Sphider Home page.