Skip to content
Marketing Factory Digital GmbH
Contact
Logo Marketing Factory Digital GmbH
  • Agency
    • About us
    • History
  • Services
    • Consulting, Analysis and Strategy
    • Programming and Development
      • Interface Development
      • PIM/ERP Links
      • Custom Development
      • Seamless CMS Integration
    • Hosting and Support
      • Cloud Strategies
      • Hosting Partners of Marketing Factory
    • Services with Third Parties
  • Technology
    • TYPO3
      • Current TYPO3 Versions
    • Shopware
    • IT Security
      • DDoS Protection
      • Continuous Upgrading
      • Privacy First
    • Tech Stack
      • Commitment to Open Source
      • Technology Selection
      • PHP Ecosystem
      • Containerisation & Clustering
      • Content Delivery Networks
      • Search Technologies
  • References
    • Projects
    • Clients
      • Client List
    • Screenshot of the homepage of the new Maxion Wheels websiteNEW: Relaunch of the corporate website of Maxion Wheels
  • Community
    • Community Initiatives
  • Blog
  • Contact
  • Deutsch
  • English

You are here:

  1. Blog
  2. Unicode, ISO-8859-1 and even more character salad
"Willkommen" in mehreren Sprachen
  • Development
  • Tutorial
06.07.2020

Unicode, ISO-8859-1 and even more character salad


Today we want to take a look at the topic of content from a completely different perspective, namely its presentation. No, no, I don't mean which layout is used or which color scheme is chosen. I'm talking about script, i.e. the legible presentation of text. And, of course, all the pitfalls in the IT and especially the web environment that lurk here.

Off to the good ol' days

Let's have a look, back at the year of 1993: Non-Latin written languages were still hardly common in our latitudes. IT (at least in Germany) was still called EDV (Elektronische Datenverarbeitung, i.e. electronic data processing), all employees looked at 80-character wide text screens (whether they showed light gray, green or amber text on a black background was up to you) and the needs of the German local data processing guild often ended at the Oder-Neisse border (translators' note: this border depicted the border to Poland and German IT specialists often used to only care for their own problems not respecting foreign languages at all). With ASCII you could easily get through all situations - 7 or 8 bits per character, depending on the definition - and 640K ought be enough for everyone anyway.

Desktop publishing (DTP) was already a thing, but it wasn't until the widespread introduction of graphical operating systems (Windows 3.11 sends its regards) that this technology became suitable for the masses. Now every school newspaper was designed on a PC and TrueType conquered the world. The western world spoke Latin-1 (ISO-8859-1 to be exact) throughout and everyone was happy. Until, at the turn of the millennium, the euro and its € were introduced and all of Europe had to deal with something like character sets.

Suddenly it was important that the house typeface to be used was "€-capable" and the first hacks were hastily worked out. Who knows how long after that "=" was diligently printed over a "C" because the systems and fonts used did not yet know the character?

Unicode to the rescue!

After a single new character had already turned the digital scene upside down, it was realized that a more sustainable solution was needed. Since it quickly became clear that it would not be possible to represent all the writing systems in the world with the 256 bit combinations offered by the 1-byte characters, it was decided to go straight to the point. With Unicode a standard was created, that covers an encoding with at present 1,114,112 theoretically possible symbols, of which currently approximately 143,000 were assigned. Unicode defines so called code points, i.e. put simply: It takes all known characters and simply numbers them. Each character gets a number and you can look up each character in the big list, e.g. No. 129335 for 🤷. For reasons of backwards compatibility you define the first 128 entries as identical to ASCII.

Lesson learned: As we have already learned from IPv4, things always get exciting in IT when someone says: "This will do." 🙂 .

At this point, this story could be over. However, there are still two problems.

Text and writing are not so simple, ...

First, let's start with the more complicated one: To do this, we must first briefly take off our Germanic glasses and ask ourselves: What exactly is a "character"?

The first answer is most likely: Well, probably that means a letter. But a look west of Alsace and the answer becomes more complicated. In addition to the "e", the French also know the "é" (with accent aigu) and the "è" (with accent grave). Everyone knows that on a German keyboard you type these characters in two steps - first the accent and then the "e". Unicode maps this in the same way, i.e. there is a code point for the "e" and such for the accents, which are also counted among the so-called diacritical characters.

The slyboots might now say: "Wait, just a moment ago we were told that ASCII characters can still be used and an 'é' certainly already existed in DOS times".

That is also true. In fact, many characters are found in Unicode several times and also in several combinations, e.g. like the example above as a complete package and also in individual parts. One reason for this is that it is common in Asian countries to combine characters in different ways. It is therefore worthwhile to represent diacritical characters separately in order to have to define fewer combinations and thus fewer code points. The print experts among you may forgive me for not mentioning ligatures etc. at this point, which also exist as separate Unicode code points. But that would really lead too far now.

… but would you have thought they were so complex?

Another example of the complexity of it all: If two Hindus were to greet each other in a WhatsApp chat, they might do so with the phrase नमस्ते - known to us more Latinized as "namaste." Now the prize question: How many characters are there?

Hmm. We mark the little word successively with the mouse and possibly notice as I do (Chrome under macOS) that the cursor "clicks" three times; so maybe three characters? Unfortunately wrong. The correct answer is, as so often: It depends 🙂 .

Namely, whether one means with "characters" so-called Unicode graphemes (the indivisible, quasi the font atoms): then the answer would be six. Or whether one would rather count so-called Unicode grapheme clusters. A grapheme cluster in terms of Unicode does not mean a tabloid newspaper in the waste paper container, but a basic character plus all diacritical characters that complement it. In the end, this comes quite close to the well-known "letter". With this procedure we come to four such "clusters": "न", "म", "स्" and "ते". How do you come up with three now? The unsatisfying truth: ideally not at all, because three is definitely wrong. Unicode support apart from Western languages is often still incomplete and prone to bugs everywhere. Hordes of developers were driven to despair by the fact that such UTF-8 strings as above consisted of four characters (always think of the heaps when you say "characters"!), which are composed of six Unicode code points and then occupy 18 bytes encoded in UTF-8.

In the next blogpost of this series, I'll bring the computer in the context of Unicode into play and tell you why every Unicode-plagued developer needs a Schei�-Encoding-T-Shirt (translator's note: shitty encoding) sooner or later 😉

You wear such a t-shirt out of conviction? Then apply with us!

Christian Spoo

"Mr. Fix-It" likes to impose his will on software and hardware. Speaks fluent meme and picdump. Responsible for development and technical design at Marketing Factory.

More posts by this author

Get blog posts as RSS feed

All parts of this blog series

  1. "Willkommen" in mehreren SprachenUnicode, ISO-8859-1 and even more character salad
  2. "Willkommen" in mehreren SprachenUTF-8 and the question how Unicode gets into the computer

Please feel free to share this article.


Comments

No comments yet.

Write a comment.

I have been informed that the processing of my data is on a voluntary basis and that I can refuse my consent without detrimental consequences for me or withdraw my consent at any time to Marketing Factory Digital GmbH by mail (Marienstraße 14, D-40212 Düsseldorf) or e-mail (info@marketing-factory.de).

I understand that the above data will be stored for as long as I wish to be contacted by Marketing Factory. After my revocation my data will be deleted. Further storage may take place in individual cases if this is required by law.

  • Data privacy policy
  • Legal notice

© Marketing Factory Digital GmbH

Picture Credits
  1. "Willkommen in mehreren Sprachen": Tumisu / License: Pixabay License (CC0 1.0)