Glyphs and bytes come out to play: Unicode presentation and AGM, York, 20th June 2015 – Edwina Simpson writes:

Always a pleasure to visit York, even on a slightly unsettling grey but warm day. However, it was good to be out of the drizzle, and by the time I arrived at the Friends Meeting House on Friargate, just along from the Clifford Tower, fellow YTI’ers were already starting to tuck into the lunch goodies provided. A few of us eyed up the piano – which of the committee was going to do a musical turn?

First on the agenda for the afternoon was Chris Thompson’s presentation on Unicode and how it enables people around the world to use computers in any language. I imagine we have all wondered why the words we read on screen are sometimes rendered unintelligible by the appearance of unexpected symbols.  This can be amusing if you’re in the right mood, but is obviously something to be avoided.

Chris talked us through how, in the early days of computing, confusion arose because there was no standard representation of letters or symbols. In 1963, an attempt was made with the ASCII code to pin down the letters and punctuation marks for the major European languages by allocating one number for each character. There was a bit of maths at this point (we cowered in our seats) to explain the binary numbering that computers use –  for the record, each 1 or 0 in binary code is a bit; 8 bits make a byte.

The ASCII code of 1963 was based on the English alphabet and had 128 letters, numbers, punctuation and control characters (e.g. ‘carriage return’), but was soon extended to include accented letters with diacritical marks. In typographical terms, Chris informed us, letters and other characters are known as glyphs – they describe the meaning of the character without describing the shape, size or colour. So A is described as Latin Capital Letter A, and ? is Greek Small Letter Epsilon, etc. Have a look when you’re next in Word; the description is towards the bottom left of the Symbols window.

However, there were multiple encodings amongst the various systems such as DOS, Windows and Mac and duplicate meanings for code numbers, and so Unicode entered the fray as a single standard for all scripts, providing a unique number for every glyph irrespective of platform, program and language. The Unicode Consortium is a non-profit organisation that includes industry giants such as Adobe, Apple, Google, IBM, Microsoft, Oracle, SAP and Yahoo, but also government ministries, user groups and others.

What’s known as the Basic Multilingual Plane (BMP) offers 65,535 code numbers (enough for all the current major scripts of the world), and Chris was able to show us a rather wonderful, mind-boggling poster containing the current configuration.

In addition to the BMP, there are code allocations for other pictographs, historical scripts, musical notation etc. Proposals for new additions to the standard are submitted to the Unicode Technical Committee. Not everything is approved – apparently, a Klingon script was rejected, one reason being the ‘Lack of evidence of usage in published literature’!

Just as we were breathing more easily at the thought of this all-encompassing standard solution, Chris noted that there were variations …. Russian, Chinese, Indian and Japanese standards all exist, some of which partly coincide with Unicode.

Also, a variety of encodings are available to deal with the practical difficulties arising from the sheer volume of glyphs that need to be represented. But Chris kindly advised us that we needn’t panic, and that code UTF-8 is the one we are most likely to meet.

After this, Chris gave us some practical examples of trouble that arises when operating systems and fonts are not up-to-date with the most recent Unicode version or where there is incompatibility between encodings. These are the question marks, lozenges and other oddities that we have no doubt seen popping up in, say, a website, where the browser was expecting one encoding but was given another.

Chris also pointed out some particular downsides of a wrong character set being used. If a character that looks like, say, a degree sign ( ° ), but is in fact slightly different ( º ), any future users of the text will have difficulty finding the character in a search at a later date. Even more problematically, those of malicious intent may take opportunities to fake websites by using a lookalike font for the domain name.

Somehow, Chris managed to distil his years of hard-won experience in grappling with transforming formats in databases into just an hour or so, and, no, I don’t think any of us did fall asleep despite the hints written into his text. We were also grateful for the cup of tea!

The AGM followed, with all the necessary procedural matters dispatched in customary fashion, and it was good to be reminded of all the varied events that the committee had organised during the year and some of the plans for upcoming activities. The finances are in reassuringly good shape and the committee was asked to look at ways of spending some of our surplus – always a difficult task! Finally, Catherine Greensmith was given a vote of thanks on retiring from the committee after her 20+ consecutive years of service!

Many thanks to our coordinators and committee for organising this event and for all their hard work helping the YTI group to flourish, and to Chris for his very entertaining and useful presentation.