Character Encoding in MARC records ( MARC-8 and Unicode )

Introduction

I’ve been doing quite a bit of work with character encoding in MARC records. While MARC records are encoded in MARC21 (if not XML) there is generally a choice on character encoding between MARC8 and Unicode. So, what is MARC8?

MARC8 Encoding

Before there was unicode, librarians already had to deal with the large number of non-English characters in their records. Out of this need, MARC-8 encoding standard was created in 1968. This standard continues to be supported in library systems and the standard is still maintained by the Library of Congress.

The first 128 standard ASCII characters remain unchanged by this character encoding, so MOSTLY the intricacies of the encoding scheme can be ignored. However, once diacritics appear in a field, they will be mapped to Unicode quite strangely. This is because one of the features of MARC-8 is to include ‘combining characters’. This is a character placed before a standard character and indicates a diacritic (or two) should be added to the character.

For example, the byte 0x6F (or 111) maps to “o” in ASCII and MARC8. If you preface it with a combining character like 0xE3 (or 227), then it maps to “ô”. Put the same combining character in front of “z” and you get “ẑ”.

Unfortunately, if you don’t plan for encoding translation, the combining character will be treated as unreadable character and then you will get a Z after that. So, saving to Unicode systems results in a scramble of letters.

The MARC-8 standard also allows different character sets to be referenced. These are essentially code pages which are loaded for quite different alphabets.

Encoding in a MARC21 Record

So, how do you know which type of character encoding a MARC21 record uses? By inspecting the 9th position in the leader. If the 9th position is a space (often represented as a pound sign on sites which explain MARC21 leaders) then it is encoded as MARC8 character encoding. If the ninth position in the leader is an ‘a’, then it is encoded as “UCS/Unicode”. This almost always means UTF-8 encoding.

If alternate character sets are used, then the sets are referenced in the 066 field within a MARC21 record.

SobekCM MARC Library (C#)

The newest version of the SobekCM MARC Library includes the most common combinational characters and a mapping between MARC8 to Unicode/UTF8, although alternate character sets are still not supported. Libraries which deal heavily in alternate character sets are more likely to be aware of character encoding issues and export in unicode encoding. What is more common is to have just a spattering of characters which decode incorrectly.

A quick survey of some other solutions found that it is fairly common to have no handling for higher characters at all, let alone work on translations. That is because strings are encoded as Unicode strings in most modern languages. So, any solution which relies heavily on strings will encode the characters incorrectly, and can likely fail completely due to the way MARC21 records are encoded with a directory structure.

In this library, to add full support for character encoding, all strings and StreamReader classes had to be converted by byte arrays (or byte MemoryStreams) and the BinaryReader class. Another interesting component of C# is that the BinaryReader.Read() class actually reads the underlying stream as a character! So, this still failed. It was only when the BinaryReader.ReadByte() method was used that the true byte stream (un-molested by any encoding) was accessed.

My solution took a mapping for combinatorial characters and implements it in code, so incoming records which are MARC8 encoded are transformed, during the parsing process, into correctly encoded unicode strings. This provides support for the combinatorial type encoding of MARC8.

Posted in Uncategorized | Leave a comment

C# Marc Library Released

I recently released the code used for working with MARC records within many of the open-source projects I have worked on. This library has been developed over the last six years and is used in several open-source projects and within the University of Florida Digital Library Center workflow applications. The term ‘SobekCM’ references the two open-source applications which have been previously released, but this library should be able to be used across many different applications. I am releasing it as open-source here as we are planning on using this library more commonly in other upcoming projects the UF Libraries are involved in.

This is a C# library which contains classes for working in memory with MARC records. This allows records to be read from MarcXML and Marc21 formats. Once in memory any field or subfield can be edited, added, or deleted. Then the record can be queried or saved again in either a MarcXML or Marc21 file format.

I plan to complete the Z39.50 portion of this library within the next couple weeks and perhaps add a simple interface to the Marc record which allows you to query for simple dublin core values ( i.e., title, creator, etc.. ) without a full understanding of the MARC record format. I would also like to add better error handling.

I find that this works best when working with either MarcXML or Marc21 that is encoded with Unicode character encoding (rather than Marc character encoding).

This library can be downloaded at: https://sourceforge.net/projects/marclibrary/.

Other Implementations

As I was preparing this for release, I became aware of another C# implementation of the MARC record ( available at http://sourceforge.net/projects/csharpmarc/ ). Comparing the two implementations showed there is really only one way to skin a xxxx, as our data objects are quite similar. I encourage you to look at the other implementation, as well as this one.

There are a couple differences in our current implementations:

  • In some ways I like his data model of seperating the control and data fields into seperate classes which both inherit from an abstract fields class. I actually begain implementation this way, but decided against the constant casting and having to look in two different collections which this entailed.
  • The other implementation has better error-handling and retains a collection of warnings discovered during decoding. I like this approach, and may add this to my implementation, which currently throws exceptions when an error is caught.
  • The other implementation has better unit testing built into their implementation.
  • Memory management in this implementation is superior as more of the reading is handled as streams, rather than loading the entire Marc21 file into memory and working on all the records from a list. This probably also impacted the way error-handling and warnings were handled since in this implementation you generally step through file, record-by-record.
  • This implementation uses LINQ in both LINQ to XML and for querying within records.
  • Memory management and speed are enhanced in this implementation with more widespread use of the StringBuilder class, rather than concatenating separate strings which is fairly costly procedure when performed repeatadly.

Again, I encourage you to look at the other implementation.

Posted in Uncategorized | Leave a comment

Handling time in stored procedures

Below are a couple of time-related features I have recently been using in SQL Server 2008.

Execution duration
There is nothing better than looking at the execution time on stored procedures and portions of stored procedures to determine what could use some refactoring. I keep having to search for the same posts regarding timing around a chunk of sql code, so I have included below.

Waitfor command
Also, the waitfor command is a new one I wasn’t familiar with. You can wait for a particular delay, or until an actual point in time. It is particularly useful for long running procedures which need to pause and allow other processes access. It pauses the current transaction and allows other to grab some CPU cycles. I recently used it on a large procedure to change the search index database design. I had a cursor stepping through 320,000 items and then a cursor stepping through all the metadata associated with each of those items, for more than 10 million discrete metadata pairs. It worked wonders indexing everything in five hours while not blocking any requests. ( more on waitfor )

Example:

– Declare and record the start time
declare @starttime time;
set @starttime = GETDATE();

– Lets just wait for five seconds
waitfor delay ’00:00:05′;

– Declare and record the end time
declare @endtime time;
set @endtime=GETDATE();

– Output the difference
select DATEDIFF( ms, @starttime, @endtime) as [duration in milliseconds];

Posted in SQL Tricks | Tagged , , | Leave a comment