Introduction
I’ve been doing quite a bit of work with character encoding in MARC records. While MARC records are encoded in MARC21 (if not XML) there is generally a choice on character encoding between MARC8 and Unicode. So, what is MARC8?
MARC8 Encoding
Before there was unicode, librarians already had to deal with the large number of non-English characters in their records. Out of this need, MARC-8 encoding standard was created in 1968. This standard continues to be supported in library systems and the standard is still maintained by the Library of Congress.
The first 128 standard ASCII characters remain unchanged by this character encoding, so MOSTLY the intricacies of the encoding scheme can be ignored. However, once diacritics appear in a field, they will be mapped to Unicode quite strangely. This is because one of the features of MARC-8 is to include ‘combining characters’. This is a character placed before a standard character and indicates a diacritic (or two) should be added to the character.
For example, the byte 0x6F (or 111) maps to “o” in ASCII and MARC8. If you preface it with a combining character like 0xE3 (or 227), then it maps to “ô”. Put the same combining character in front of “z” and you get “ẑ”.
Unfortunately, if you don’t plan for encoding translation, the combining character will be treated as unreadable character and then you will get a Z after that. So, saving to Unicode systems results in a scramble of letters.
The MARC-8 standard also allows different character sets to be referenced. These are essentially code pages which are loaded for quite different alphabets.
Encoding in a MARC21 Record
So, how do you know which type of character encoding a MARC21 record uses? By inspecting the 9th position in the leader. If the 9th position is a space (often represented as a pound sign on sites which explain MARC21 leaders) then it is encoded as MARC8 character encoding. If the ninth position in the leader is an ‘a’, then it is encoded as “UCS/Unicode”. This almost always means UTF-8 encoding.
If alternate character sets are used, then the sets are referenced in the 066 field within a MARC21 record.
SobekCM MARC Library (C#)
The newest version of the SobekCM MARC Library includes the most common combinational characters and a mapping between MARC8 to Unicode/UTF8, although alternate character sets are still not supported. Libraries which deal heavily in alternate character sets are more likely to be aware of character encoding issues and export in unicode encoding. What is more common is to have just a spattering of characters which decode incorrectly.
A quick survey of some other solutions found that it is fairly common to have no handling for higher characters at all, let alone work on translations. That is because strings are encoded as Unicode strings in most modern languages. So, any solution which relies heavily on strings will encode the characters incorrectly, and can likely fail completely due to the way MARC21 records are encoded with a directory structure.
In this library, to add full support for character encoding, all strings and StreamReader classes had to be converted by byte arrays (or byte MemoryStreams) and the BinaryReader class. Another interesting component of C# is that the BinaryReader.Read() class actually reads the underlying stream as a character! So, this still failed. It was only when the BinaryReader.ReadByte() method was used that the true byte stream (un-molested by any encoding) was accessed.
My solution took a mapping for combinatorial characters and implements it in code, so incoming records which are MARC8 encoded are transformed, during the parsing process, into correctly encoded unicode strings. This provides support for the combinatorial type encoding of MARC8.