Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Search - "iso-8859-8-i"
-
*wrestling commentator voice*
"In this weeks episode of encoding hell:
The iiiinnnfamous UTF-8 Byte Order Mark veeeersus PHP!"
For an online shop we developed, there is currently a CSV upload feature in review by our client. Before we developed this feature, we created together with the client a very precise specification, including the file format and encoding (UTF-8).
After the first test day, the client informed us, that there were invalid characters after processing the uploaded file.
We checked the code and compared the customer's file with our template.
The file was encoded in ISO-8859-1 and NOT as specified UTF-8.
But what ever, we had to add an encoding check, thus allowing both encodings from now on.
Well well well welly welly fucking well...
Test day 2: We receive an email from said client, that the CSV is not working, again.
This time: UTF-8 encoding, but some fields had more colums with different values than specified.
Fucking hell.
We tell the customer that.
(I was about to write a nice death threat novel to them, but my boss held me back)
Testing day 3, today:
"The uploading feature is not working with our file, please fix it."
I tried to debug it, but only got misleading errors. After about 30 minutes, at 20 stacks of hatered, I finally had an idea to check the file in a hex editor:
God fucking what!?!!?!11?!1!!!?2!!
The encoding was valid UTF-8, all columns and fields were correct, but this time the file contained somthing different.
Something the world does not need.
Something nearly as wasteful as driving a monster truck in first gear from NYC to LA.
It was the UTF-8 Byte Order Mark.
3 bytes of pure hell.
Fucking 0xEFBBBF.
The archenemy of PHP and sane people.
If the devil had sex with the ethernet port of a rusty Mac OS X Server, then 9 microseconds later a UTF-8 BOM would have been born.
OK, maybe if PHP would actually cope with these bytes of death without crashing, that would be great.3 -
Crazy... Hm, that could qualify for a *lot*.
Craziest. Probably misusage or rather "brain damaged" knowledge about HTTP.
I've seen a lot of wild things when devs start poking standards, but the tip of the iceberg was someone trying to use UTF-8 in headers...
You might have guessed it - German umlauts. :(
Coz yeah. Fucktard loved writing everything in german, so why not write custom header names in german.
The fun thing is: It *can* work, though the usual sane thing is to keep it in ASCII range for the obvious reason that using UTF-8 (or ISO-8859-1, which is *not* ASCII) is a gamble you gonna loose.
The fun game was that after putting in a much needed load balancer between services for monitoring / scaling etc suddenly *something* seemed off.
It took me 2 days and a lot of Wireshark hoola hooping to find out why, cause the header was used for device detection aka wether it's a bot or not. Or in the german term the dev used: "Geräte-Art".
As the fallback was to assume a bot, but only rate limit based on IP, only few managed to achieve the necessary rate limit to get blocked.
So when I say *something* seemed off, I really mean a spooky kind of "sometimes IP blocked for seemingly no reason at all".
Fun stuff. The dev btw germanized everything. Untangling the code base was a lot of non fun. -.-6 -
So for those of you keeping track, I've become a bit of a data munger of late, something that is both interesting and somewhat frustrating.
I work with a variety of enterprise data sources. Those of you who have done enterprise work will know what I mean. Forget lovely Web APIs with proper authentication and JSON fed by well-known open source libraries. No, I've got the output from an AS/400 to deal with (For the youngsters amongst you, AS/400 is a 1980s IBM mainframe-ish operating system that oriiganlly ran on 48-bit computers). I've got EDIFACT to deal with (for the youngsters amongst you: EDIFACT is the 1980s precursor to XML. It's all cryptic codes, + delimited fields and ' delimited lines) and I've got legacy databases to massage into newer formats, all for what is laughably called my "data warehouse".
But of course, the one system that actually gives me serious problems is the most modern one. It's web-based, on internal servers. It's got all the late-naughties buzzowrds in web development, such as AJAX and JQuery. And it now has a "Web Service" interface at the request of the bosses, that I have to use.
The programmers of this system have based it on that very well-known database: Intersystems Caché. This is an Object Database, and doesn't have an SQL driver by default, so I'm basically required to use this "Web Service".
Let's put aside the poor security. I basically pass a hard-coded human readable string as password in a password field in the GET parameters. This is a step up from no security, to be fair, though not much.
It's the fact that the thing lies. All the files it spits out start with that fateful string: '<?xml version="1.0" encoding="ISO-8859-1"?>' and it lies.
It's all UTF-8, which has made some of my parsers choke, when they're expecting latin-1.
But no, the real lie is the fact that IT IS NOT WELL-FORMED XML. Let alone Valid.
THERE IS NO ROOT ELEMENT!
So now, I have to waste my time writing a proxy for this "web service" that rewrites the XML encoding string on these files, and adds a root element, just so I can spit it at an XML parser. This means added infrastructure for my data munging, and more potential bugs introduced or points of failure.
Let's just say that the developers of this system don't really cope with people wanting to integrate with them. It's amazing that they manage to integrate with third parties at all...2 -
Writing hebrew in Latex using a template that doesn't support UTF-8 is the most archaic shit I've done lately. I feel like some sort of a caveman.
This fucking encoding inverts all letters so it can support right to left. 😓4 -
Who, more than I, totally HATE emoji?
lol I hate emoji after it caused so much problems with Microsoft Outlook and email backups from said program combined with emoji in subjects.
Wrote an subject filter in exim4 (took 3 days to debug and get working propely) that totally eradicate anything that isnt ISO-8859-1 from the subject line, then converts the rest to UTF-8 (because said IMAP client isnt following standards).
it also converts ISO-8859-1 characters in subjects to UTF-8 even if the original subject is declared to be UTF-8, because obviously some software (especially newsletter software) are transmitting ISO-8859-1 subjects that are declared to be in UTF-8 (but the opposite isn't true).
And also cuts subject to 100 chars, because too long subjects are a problem too. Same with date headers, I replace them with the server date/time because some software are sending Date: 1970 Jan 01 00:00:00, because some of these erronous headers are put by some mailing list software, aswell as causing problem in OEM clients like Samsung Mail.
Problem solved, all IMAP clients happy on internal network.7