I wanted to build an awesome place for people to discuss module specific issues, but I don't have any more time for this, and there are much better places to discuss Perl-related issues. I'd recommend asking your question on Stack Overflow or on Perl Monks.
If you are looking for a Perl tutorial or Perl-related news, I hope these links will serve you well.
Posted on 2007-05-15 07:41:47-07 by pdi
Warning: Malformed UTF-8 character(s)
This is possibly an odd case, but I would certainly appreciate any help.

I prepare an ARGFILE, in utf-8, with XMP tags like -xmp-dc:title=. Exiftool writes the data correctly to jpegs.

I then (1) change the tags to the corresponding IPTC ones, e.g. -xmp-dc:title= to -iptc:ObjectName=, and (2) use iconv to convert the ARGFILE encoding from utf-8 to iso-8859-7 (Greek), as most programs do not read utf-8 IPTC. The resulting file is read correctly by text editors.

However, when exiftool tries to write the data to jpegs it returns "Warning: Malformed UTF-8 character(s)". The cause seems to be the Greek characters.

Both exiftool and iconv are well established, so perhaps I do something out of place. But if not, is there a way exiftool can accept the iconv output? Or is there another standard encoding conversion tool that exiftool is happy with?

Many thanks in advance,
pdi
Direct Responses: 5135 | Write a response
Posted on 2007-05-15 10:53:20-07 by pdi in response to 5133
Re: Warning: Malformed UTF-8 character(s)
After some trials I found to my surprise that, irrespective of iconv, the same error occurred with both entirely new txt iso-8859-7 files and old ARGFILES in the same encoding from about a year ago which worked perfectly.

Preliminary findings point to an important change in exiftool in ver. 6.70 about the treatment of encoded characters. I am still trying to understand it's implications. From a first reading it seems to cater either for utf-8 or cp1252. What about cp1253 (iso-8859-7)?

Regards,
pdi
Direct Responses: 5136 | Write a response
Posted on 2007-05-15 11:31:15-07 by exiftool in response to 5135
Re: Warning: Malformed UTF-8 character(s)
Yes, ExifTool now translates coded characters for IPTC. See FAQ #10 for details.

You can use the -L option when writing IPTC if you want to disable translation of special characters.

- Phil
Direct Responses: 5138 | Write a response
Posted on 2007-05-15 12:04:31-07 by pdi in response to 5136
Re: Warning: Malformed UTF-8 character(s)
Phil,

Thank you for your reply. I was confused by the mention only of cp1252, but when I tried the -L option the result was correct. I'm not sure I understand it, but I'm glad it works.

Regards,
pdi
Direct Responses: 5140 | Write a response
Posted on 2007-05-15 12:16:10-07 by exiftool in response to 5138
Re: Warning: Malformed UTF-8 character(s)
This works because 1) ExifTool assumes IPTC in the file is coded in Latin1 unless the recorded CodedCharacterSet is "ESC % G" (UTF8), and 2) the -L option specifies the external character set as Latin1.

When the recorded character set is the same as the external character set, no translation is performed.

I hope this makes a bit more sense now. :)

- Phil
Direct Responses: 5141 | Write a response
Posted on 2007-05-15 12:46:37-07 by pdi in response to 5140
Re: Warning: Malformed UTF-8 character(s)
Phil,

I'm afraid I was not very clear about what I don't understand. Encodings and translations is a terrain only partly familiar to me. So I wonder how it all works when, while -L denotes the txt file character set as Latin1 (cp1252), the file's character set is Greek (cp1253). To be more exact, various text editors recognize the file as ANSI, but the underlying code page in Windows for Greek is cp1253. So exiftool is told to write cp1252 and writes in fluent cp1253 :-) It suits me fine, but I'd rather understand it than not :-)

Regards,
pdi
Direct Responses: 5143 | Write a response
Posted on 2007-05-15 13:06:24-07 by exiftool in response to 5141
Re: Warning: Malformed UTF-8 character(s)
I understood your confusion, but I guess you didn't understand my explanation.

It is really fairly simple. You give ExifTool a string of bytes and tell it what character encoding was used. As long as ExifTool thinks that the internal and external character sets are the same, then no translation is performed and the bytes are passed through unchanged. (This is the behaviour of older ExifTool versions for IPTC information.)

As long as ExifTool is not translating the text, it is totally irrelevant what character set is actually used since the bytes are passed through unchanged. So as long as ExifTool believes there is no need to translate the text, you are free to use whatever character set you like.

I can see how this could be confusing.

If possible, it is best to use UTF8 to avoid this confusion.

- Phil
Direct Responses: 5146 | Write a response
Posted on 2007-05-15 16:45:16-07 by pdi in response to 5143
Re: Warning: Malformed UTF-8 character(s)
Phil,

I appreciate your patience with my dim wits :-) All is much clearer now.

As long as ExifTool is not translating the text, it is totally irrelevant what character set is ac +tually used since the bytes are passed through unchanged.

Perhaps you might include some similar note in FAQ #10, to make it clearer we are not limited only to cp1252.

I am writing IPTC data to a jpg which has no previous IPTC data, only XMP; so my guess is that ExifTool handles the case of no internal data the same as if these existed and were of the same character set with the external ones.

Unfortunately, many IPTC tools cannot handle the notorious "ESC % G" sequence and fail to display utf-8 properly. I was very surprised to see the change in the default behaviour of ExifTool, but I am sure you had very sound reasons for it. It must be that the tide is turning :-)

Regards,
pdi
Direct Responses: 5147 | Write a response
Posted on 2007-05-15 17:43:00-07 by exiftool in response to 5146
Re: Warning: Malformed UTF-8 character(s)
I'm glad it makes a bit more sense now.

When writing information, ExifTool uses the value of CodedCharacterSet to determine how to encode the text. If CodedCharacterSet is being written at the same time as text, the new character set is used. If no CodedCharacterSet exists and none is written, then Latin1 is assumed.

The special character handling in IPTC is a real mess. The way ExifTool originally handled it (by never translating) was simplest, but it seems that other applications most commonly assume Latin1 characters (contrary to the actual IPTC specification) so ExifTool was displaying special characters written by these applications incorrectly. This is the reason for the change.

If enough people have problems with this, I am open to changing it back again.

It is a pity that not many applications support UTF8 in IPTC, because this is the best solution. The original IPTC specification used ISO 2022, which is a real can of worms and hence isn't well supported either, but UTF8 support was added as a revision to the IPTC specification (I believe), and is a much better solution.

- Phil
Direct Responses: 5233 | Write a response
Posted on 2007-05-25 19:18:08-07 by exiftool in response to 5147
Re: Warning: Malformed UTF-8 character(s)
For reference, here is the thread which prompted the change in handling of special characters in IPTC.

- Phil
Direct Responses: Write a response