Emojis revisited

Not very long ago, I wondered whether or not Sphider still needed to scan for, and remove, emojis. This came about because of a change in the database from 3-byte utf to 4-byte. Upon testing, the scan and removal of emojis will continue. Sphider, and more particularly, MySQL, just doesn’t like emojis. When trying to store any full text containing an emoji, an SQL exception is thrown and the page is not stored.

The earlier issue with the function that was reported is due to the removeEmoji() function operating on an utf-8 level, and the probability of the input NOT being utf-8. For future releases of Sphider, this function will be executing AFTER it is (nearly) guaranteed that the input will be utf-8. (I say “nearly” because there are no guarantees in this world where code is involved.)

It was also noted that the function, as currently implemented, is somewhat outdated.  While updated it is possible, the function would become a bit  unwieldy.  Leaving it alone is practical, however. This is because pages containing emojis are, while not rare, relatively uncommon. And withing the pages that DO contain an emoji, the odds are that emoji is of the simpler, more common type. The kind an expanded filter would catch ARE rare in web pages, being more likely to occur in messaging applications used in smart phones and tablets. In other words, why add the complexity to Sphider to catch something that the vast majority of users are never going to encounter?

Maybe someday I will once again update the database collation to use utf8mb4_unicode_ci as opposed to the current utf8mb4_general_ci, which should allow these emojis, but even if I do, there will probably be a setting to exclude them anyway.


Content-Type meta tags and HTTP response headers

How many of us have used a meta tag to define content type and default character sets? The tag may appear something like this:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

But do we REALLY understand what is going on? This tag is important when a webpage is being opened locally. It instructs the browser as to what character encoding to use to display the page. This may override the platform default.

But what about when a page is being viewed by HTTP? Well, the tag is important if the HTTP response header(s) being sent fail to designate a default character encoding. What if the response header(s) DO include a default character set? AHHH! The the meta tag is (are you ready for this?)… IGNORED!

Let’s say you designate a page, via meta tag, to have a character set of UTF-8, but your web server is sending a response header setting the default as Windows-1252. Your page is going to display in Windows-1252!

And guess what? Your page, viewed over HTTP,  just may still appear correctly giving you the impression that the meta tag is working! Then you force your browser to actually display in UTF-8 and that beautiful page suddenly becomes what is referred to as “mojibake!”

There are at least a couple ways to get this all sorted out. If you are coding in PHP, one way would be to set the response header in the code for each page. Here is an example PHP header:
header('Content-Type: text/html; charset=utf-8');
This needs to appear in the PHP BEFORE a single bit of HTML is displayed.

Another way is if you have access to your server settings, you can specify a default character set.

Still another way, with Apache servers, is to specify a default character set in you .htaccess file.
AddDefaultCharset UTF-8

So…. knowing all this, just HOW do you go about confirming that the character set you want is the character set actually being set? With Firefox/Waterfox/SeaMonkey, bring up the page in question. Up in the url display area, to the left of the url, click on the little circle with the upside-down “!”. There will be information on whether or not the connection is secure, then a “>”. Click on that. Click on “More information”, the the “General” tab.  This will display the text-encoding AND the meta tags. If they don’t agree, the response header being sent isn’t what you want it to be. This applies to Waterfox in Linux, also.

Google Chrome USED to allow the option to see what the default character set REALLY is, but they removed it. Fortunately, there is an extension that does it for you. The extension is simply named “Charset”, and allows you to not only see what the actual character set is for a given page, allows you to change it. The results may be an eye opener. BTW, this applies to Linux Chromium as well.

What about IE/Edge? You’re on your own! I won’t touch those monstrosities! LOL!!

The future of the PDO edition of Sphider…

Sphider comes in two editions, the legacy version and a PDO version. The legacy version is definitely the more stable, faster, easier to maintain version. The PDO version exists primarily for those who are restricted by their shared hosting providers.

Shared hosting has its advantages in that it is very cost effective (cheap) and very simple to use. It is great for personal use or for small businesses or organizations just getting started on the web.

But shared hosting has its downsides, too. It isn’t nearly as efficient, isn’t as secure, suffers from limited resources, and has limited functionality. One of the features commonly lacking in shared hosting is MySQLnd. Thus the need for PDO.

The are quite a few users of the PDO edition, and to simply drop PDO would be a great disservice. On the other hand, trying to keep the PDO edition in sync with the legacy edition is getting harder and requiring much time and effort.

The PDO version, as it stands, is quite usable. It is PHP 7.3 compliant, so it should be reasonably set for awhile, as the majority of shared hosting plans are still at least a few versions behind 7.3!

The thought is that the time for legacy and PDO to part paths, with most future effort going into the legacy edition. Because of the user base, PDO version 2.4.0 would remain and receive hot fixes as needed.

No decision has been made and feedback will be given consideration.

Emojis and Sphider

Quite sometime back, Sphider had an indexing issue when emojis were encountered on a web page. The sql errors would fly! The solution at that time was to filter out emojis before storing in the database. This solution was working just fine, but admittedly the filter has not been updated and there are ALWAYS new emojis making their appearance.

While even the new emojis themselves have not been an issue, there was a very curious case of an emoji-free site in which the filter was clearing the entire full text of pages and storing — NOTHING! Well, that isn’t good. The workaround for that site was to disable the emoji removal function. Not an ideal fix, but very doable. As to WHY the function has this effect on that particular site is still a mystery.

But now may be the time to revisit the need for the filter in the first place. At the time the filter was installed, Sphider used the default MySQL utf8 scheme, which is 3-byte. Some emojis are 3-byte, but the vast majority are 4-byte, with even a few 8-byte emojis. You see the problem, don’t you? MySQL is not going to be happy when you try to stick a 4-byte character into 3 bytes!

Since that time, however, Sphider has moved to utf8_mb4, which IS 4-byte. This means that the troublesome 4-byte characters WILL fit into the database. As to those 8-byte emojis, well they are commonly composed of TWO 4 byte characters, which means — NO PROBLEM!

The next version of Sphider, 2.4, is VERY near release. The emoji filter remains in place. But after serious thought and consideration, and some testing, and this filter may be removed in the following release.  It is logical, but how will it test out?