Converting broken German characters from ??? to ä, ö, ü etc.

Discussion:

(too old to reply)

Tuxedo

2013-03-25 23:18:10 UTC

I use wget to fetch html files which always include some German characters,
but after fetching, they display incorrectly in various applications that
normally display German characters fine. The broken characters represent:
ä a umlaut, small
Ä A umlaut, capital
ö o umlaut, small
Ö O umlaut, capital
ü u umlaut, small
Ü U umlaut, capital
ß sharp s

I'm not sure what happens but depending on the application they're opened
in after they may show as Ã¶, Â, Ã¼, Ã¤Ã etc. These broken characters have
surely been created via a web-browser, possibly in a UFT-8 mode.

Normally when I've encounter similar problems I've worked around it with a
process like:
perl -pi -e "s/%C3%BC/ü/g;" *.html

However, this doesn't match in this case and I'm not sure what to convert
the characters from. While the fetched files are fully readable in a UTF-8
editor, such as Yuedit, not in other editors that normally read German
characters but do not work with a full UTF-8 charset. Any ideas how to
replace the Ã¶, Â, Ã¼ whatever they are with the more widely adopted
extented German ASCII charset?

Many thanks for any tips!
Tuxedo

Janis Papanagnou

2013-03-25 23:43:39 UTC

Permalink

Post by Tuxedo
I use wget to fetch html files which always include some German characters,
but after fetching, they display incorrectly in various applications that
ä a umlaut, small
Ä A umlaut, capital
ö o umlaut, small
Ö O umlaut, capital
ü u umlaut, small
Ü U umlaut, capital
ß sharp s
I'm not sure what happens but depending on the application they're opened
in after they may show as Ã¶, Â, Ã¼, Ã¤Ã etc. These broken characters have
surely been created via a web-browser, possibly in a UFT-8 mode.
Normally when I've encounter similar problems I've worked around it with a
perl -pi -e "s/%C3%BC/ü/g;" *.html
However, this doesn't match in this case and I'm not sure what to convert
the characters from. While the fetched files are fully readable in a UTF-8
editor, such as Yuedit, not in other editors that normally read German
characters but do not work with a full UTF-8 charset. Any ideas how to
replace the Ã¶, Â, Ã¼ whatever they are with the more widely adopted
extented German ASCII charset?

Either inspect the HTML source what character encoding it uses, or pass
the byte sequence of the selected characters into od(1) to see where it
may come from. Associate the expected German characters to find out the
respective mapping.

Besides; what do you think is an "extented German ASCII charset"?
Hint: there are many extensions to ASCII that contain German characters.

Then use iconv(1) to convert between well-defined character sets.

Janis

Post by Tuxedo
Many thanks for any tips!
Tuxedo

Tuxedo

2013-03-26 06:52:27 UTC

Permalink

Janis Papanagnou wrote:

[...]

Post by Janis Papanagnou
Either inspect the HTML source what character encoding it uses, or pass
the byte sequence of the selected characters into od(1) to see where it
may come from. Associate the expected German characters to find out the
respective mapping.

I'm not quite sure how how to od and find the mapping, but in the yudit
editor, the German ü character is reported to have Glyph Info 00FC [0075
0308] in case that means anything.

Post by Janis Papanagnou
Besides; what do you think is an "extented German ASCII charset"?
Hint: there are many extensions to ASCII that contain German characters.

I guess ISO 8859-1 is one suitable extension.

Post by Janis Papanagnou
Then use iconv(1) to convert between well-defined character sets.

This seems to work:

iconv -c -f utf-8 -t 8859_1 oldfile.html > newfile.html

The -c is there because there appears to be alien characters in the files
that are displayed in no known environment and which serve no intended or
useful purpose.

Many thanks for the iconv tip - I did not know this fine utility before.

Tuxedo

Tuxedo

2013-03-26 15:21:28 UTC

Permalink

[...]

Post by Janis Papanagnou
Then use iconv(1) to convert between well-defined character sets.

[...]

If original.txt contains various characters such as ä, Ä, ö, Ö, ü, Ü and ß
in utf-8 encoding but which do not necessarily require utf-8, the following
will convert them into iso-8859-1:

iconv -c -f utf-8 -t iso-8859-1 original.txt > converted.txt

If however thereafter running the same procedure on the converted file, for
example:

iconv -c -f utf-8 -t iso-8859-1 converted.txt > second_conversion.txt

... the particular ä, Ä, ö, Ö, ü, Ü, ß characters are stripped altogether
from the resulting second_conversion.txt file that has been processed a
second time. Why is this and how can it be prevented in case the procedure
is run against an already previously converted file?

Many thanks for any tips.
Tuxedo

Hermann Peifer

2013-03-26 16:02:00 UTC

Permalink

Post by Tuxedo
[...]

Post by Janis Papanagnou
Then use iconv(1) to convert between well-defined character sets.

[...]
If original.txt contains various characters such as ä, Ä, ö, Ö, ü, Ü and ß
in utf-8 encoding but which do not necessarily require utf-8, the following
iconv -c -f utf-8 -t iso-8859-1 original.txt > converted.txt
If however thereafter running the same procedure on the converted file, for
iconv -c -f utf-8 -t iso-8859-1 converted.txt > second_conversion.txt
... the particular ä, Ä, ö, Ö, ü, Ü, ß characters are stripped altogether
from the resulting second_conversion.txt file that has been processed a
second time. Why is this and how can it be prevented in case the procedure
is run against an already previously converted file?

You know that converted.txt is iso-8859-1 encoded, but you tell iconv
that the file is utf-8 encoded and that it should silently discard
characters that cannot be converted (by using -c). This is why the
umlauts silently disappear.

Hermann

Tuxedo

2013-03-26 18:11:48 UTC

Permalink

Post by Hermann Peifer

Post by Tuxedo
[...]

Post by Janis Papanagnou
Then use iconv(1) to convert between well-defined character sets.

[...]
If original.txt contains various characters such as ä, Ä, ö, Ö, ü, Ü and
ß in utf-8 encoding but which do not necessarily require utf-8, the
iconv -c -f utf-8 -t iso-8859-1 original.txt > converted.txt
If however thereafter running the same procedure on the converted file,
iconv -c -f utf-8 -t iso-8859-1 converted.txt > second_conversion.txt
... the particular ä, Ä, ö, Ö, ü, Ü, ß characters are stripped
altogether from the resulting second_conversion.txt file that has been
processed a second time. Why is this and how can it be prevented in case
the procedure is run against an already previously converted file?

Thanks for pointing this out.

The problem I found is that unless -c is used in the first conversion an
error such as "illegal input sequence at position 2997" occurs.

I guess it's because some odd character at that position is not understood.
So when I don't use -c, the replacement process stops and the resulting
file is truncated at that byte position of the file, in case of a file that
happens to contain something odd. When using -c in the first conversion,
the process appear to work in that all instances of the umlauts are
converted as intended.

What I don't understand is why in the second conversion the umlauts are
simply blanked out, especially if they are no longer utf-8, having already
converted to iso-8859-1 in the first step? Why would the second process
affect only those characters but not others? Only the converted umlauts
appears to be blanked, although perhaps some other odd characters I missed
were too. Ideally, any non-recognisable input should be discarded, not
reconverted or blanked.

Maybe it would be better to identify individual ä, Ä, ö, Ö, ü, Ü and ß and
run a replacement procedure as a separate step for each, using a perl
snipped or tr.

Anyone has a procedure to identify and convert these specific characters
into iso-8859-1 if they have been saved in utf-8 flavour?

Many thanks,
Tuxedo

Tuxedo

2013-03-26 18:49:31 UTC

Permalink

Post by Tuxedo
The problem I found is that unless -c is used in the first conversion an
error such as "illegal input sequence at position 2997" occurs.

The odd character sits between some random html tag and doesn't display
anything in any recent browser, while in an older browser it shows up as a
little quarter representing something illegible. Also, in the Yudit text
editor it shows as a same type of unknown square representation, and the
glyph info is reported as FEFF. If I try to copy it into this window the
result is: \ufeff Maybe it's the only problem my iconv procedure would
stumble upon. Does anyone know how to regex and remove these from files?

Many thanks,
Tuxedo

Janis Papanagnou

2013-03-26 19:21:06 UTC

Permalink

Post by Tuxedo

Post by Tuxedo
The problem I found is that unless -c is used in the first conversion an
error such as "illegal input sequence at position 2997" occurs.

At start of a file such a byte sequence may be a BOM (byte order mark; see
http://en.wikipedia.org/wiki/Byte_order_mark for details).

Janis

Post by Tuxedo
If I try to copy it into this window the
result is: \ufeff Maybe it's the only problem my iconv procedure would
stumble upon. Does anyone know how to regex and remove these from files?
Many thanks,
Tuxedo

Thomas 'PointedEars' Lahn

2013-03-26 21:25:19 UTC

Permalink

Post by Janis Papanagnou

Post by Tuxedo

Post by Tuxedo
The problem I found is that unless -c is used in the first conversion an
error such as "illegal input sequence at position 2997" occurs.

At start of a file such a byte sequence may be a BOM (byte order mark; see
http://en.wikipedia.org/wiki/Byte_order_mark for details).

For crying out loud …

0. US-ASCII contains German characters already as the German alphabet shares
*all* letters with the English alphabet (both use the latest Latin
alphabet). Characters with diacritics are _not_ specific to the German
language; the sharp-s ligature („Eszett“), however, is (except in Swiss
Standard German; but it is not part of the German alphabet in any case).

1. The garbled characters now displayed are Windows-1252 (“ISO-8859-1”)
representations of a UTF-8 code unit each for Unicode characters beyond
U+007F (including umlauts and the sharp-s), usually two code units (and
so, garbled characters) per encoded (real) Unicode character.

2. The actual problem here is most certainly that the original resources
were served with a “Content-Type: text/html; charset=UTF-8” HTTP header
field that takes precedence. When viewed locally (file://), that header
field is missing, and inline declarations of character encoding, if
present, take precedence: XML declaration, “meta” element, or BOM. In
that order (AFAIK). If all of those are missing, applications are left
to do non-standardized heuristics (sophisticated guesswork), which would
explain inconsistent results.

3. It is not necessary (and usually it is not successful) to convert Unicode
characters to a smaller character set like that of Windows-1252 (although
iconv(1) can do transliteration partially, but that does not result in
proper German – e.g., the first names „Hans-Rüdiger“ and „Hansruedi“ are
pronounced differently). Web browsers and editors are fully capable of
handling UTF-8 these days.

4. The encoding declaration of the resource should be fixed, either using a
local Web server that sends the proper header field, or a proper encoding
declaration in the files (which now is probably either missing or
“Windows-1252” or “ISO-8859-1”, wrong of course).

--
PointedEars, Web- and Unicode-savvy, born German

Twitter: @PointedEars2
Please do not Cc: me. / Bitte keine Kopien per E-Mail.

Cal Dershowitz

2013-03-29 08:22:41 UTC

Permalink

[snip]

Post by Thomas 'PointedEars' Lahn

Post by Janis Papanagnou
At start of a file such a byte sequence may be a BOM (byte order mark; see
http://en.wikipedia.org/wiki/Byte_order_mark for details).

Also, bitte schoen. BoM hiesse etwas anders bei Amis.

Schoenen Ishtar Gruss,

--
Cal

Tuxedo

2013-03-26 21:25:29 UTC

Permalink

Janis Papanagnou wrote:

[...]

Post by Janis Papanagnou
At start of a file such a byte sequence may be a BOM (byte order mark; see
http://en.wikipedia.org/wiki/Byte_order_mark for details).

It may be something generated by and possibly used by the CMS where it
originates. A totally useless character in my end that would best be
blanked.

Thanks for the info.

Tuxedo

[...]

Martin Τrautmann

2013-03-26 06:08:31 UTC

Permalink

Why should it? When it is a html file, either the characters should be
html-encoded or a charset must be given. So the first question is: why
is the charset definition for a html file lost?

One answer might be that no charset was given, but the web server told
the web browser about the encoding. This is typically used for text
files, which do not include a charset definition of your own.

Post by Tuxedo
However, this doesn't match in this case and I'm not sure what to convert
the characters from. While the fetched files are fully readable in a UTF-8
editor, such as Yuedit, not in other editors that normally read German
characters but do not work with a full UTF-8 charset. Any ideas how to
replace the Ã¶, Â, Ã¼ whatever they are with the more widely adopted
extented German ASCII charset?

What you read as Ã¶, that actually is an extended German charset.
I guess what you look for, that is a very special kind of extension,
such as iso-8859-15. What's your language setup of your system?

Personally, I use
LANG=en_US.UTF-8
LC_ALL=en_US.UTF-8
LC_CTYPE=UTF-8

... and yours is DE.Latin-1?

Apart from iconv, you might be interested in recode:
recode latin9 $file

The editor vim is pretty good in guessing what the charset might be and
does perform auto convert on edit.

- Martin

Cal Dershowitz

2013-03-29 07:53:21 UTC

Permalink