Emojis and Sphider

Quite sometime back, Sphider had an indexing issue when emojis were encountered on a web page. The sql errors would fly! The solution at that time was to filter out emojis before storing in the database. This solution was working just fine, but admittedly the filter has not been updated and there are ALWAYS new emojis making their appearance.

While even the new emojis themselves have not been an issue, there was a very curious case of an emoji-free site in which the filter was clearing the entire full text of pages and storing — NOTHING! Well, that isn’t good. The workaround for that site was to disable the emoji removal function. Not an ideal fix, but very doable. As to WHY the function has this effect on that particular site is still a mystery.

But now may be the time to revisit the need for the filter in the first place. At the time the filter was installed, Sphider used the default MySQL utf8 scheme, which is 3-byte. Some emojis are 3-byte, but the vast majority are 4-byte, with even a few 8-byte emojis. You see the problem, don’t you? MySQL is not going to be happy when you try to stick a 4-byte character into 3 bytes!

Since that time, however, Sphider has moved to utf8_mb4, which IS 4-byte. This means that the troublesome 4-byte characters WILL fit into the database. As to those 8-byte emojis, well they are commonly composed of TWO 4 byte characters, which means — NO PROBLEM!

The next version of Sphider, 2.4, is VERY near release. The emoji filter remains in place. But after serious thought and consideration, and some testing, and this filter may be removed in the following release.  It is logical, but how will it test out?