Perl detect encoding of file


















Viewed 1k times. There is no way to reliably detect the character set for plain text. And vim, for instance, does a great job: I can't remember if ever misdetecting a file. Windows cp is a superset of iso, though. HTML5, for instance, uses windows as a default to deal with incorrectly specified Windows content. In Latin-1, the Latin-1 Supplement assigns control characters to this range, but I have never seen them used, and I would have thought the existing 32 ASCII control characters were sufficient for any purpose.

Show 1 more comment. Active Oldest Votes. Perhaps some words about how this works? Added some comments to the code. Comments only obfuscate code. A description in a parallel text document is far more useful. It works for me. Like, it reads the file in bytes raw and convert it to given character set. Add a comment.

Anything else is less reliable, but possible I don't know what encodings you might be using, but this is the general idea. The data argument is overwritten with everything after that point; that is, the unprocessed portion of the data. This is handy when you have to call decode repeatedly in the case where your source data may contain partial multi-byte character sequences, that is, you are reading with a fixed-width buffer.

Here's some sample code to do exactly that:. This is handy for when you are debugging. This value is available since Encode version 2. These modes are all actually set via a bitmask. If you're not interested in this, then bitwise-OR it with the bitmask. As of Encode 2. For instance:. Fallback for decode must return decoded string sequence of characters and takes a list of ordinal values as its arguments. The object should provide the interface described in Encode::Encoding.

Before the introduction of Unicode support in Perl, The eq operator just compared the strings represented by two scalars. Beginning with Perl 5. To explain why we made it so, I quote from page of Programming Perl, 3rd ed. Old byte-oriented programs should not spontaneously break on the old byte-oriented data they used to work on. Old byte-oriented programs should magically start working on the new character-oriented data when appropriate.

Programs should run just as fast in the new character-oriented mode as in the old byte-oriented mode. Perl should remain one language, rather than forking into a byte-oriented Perl and a character-oriented Perl. When Programming Perl, 3rd ed. Perl 5. You can think of there being two fundamentally different kinds of strings and string-operations in Perl: one a byte-oriented mode for when the internal UTF8 flag is off, and the other a character-oriented mode for when the internal UTF8 flag is on.

This UTF8 flag is not visible in Perl scripts, exactly for the same reason you cannot or rather, you don't have to see whether a scalar contains a string, an integer, or a floating-point number. But you can still peek and poke these if you will. See the next section. The following API uses parts of Perl's internals in the current implementation. As such, they are efficient but may change in a future release. Returns true if successful, false otherwise. Typically only necessary for debugging and testing.

But beware, don't do this for package and module names; it might not work well. Also, consider that not everybody has a keyboard that allows easy typing of non-ASCII characters, so you make maintenance of your code much harder if you use them in your code.

You can use the following short script to your terminal, locales and fonts. It is very European centric, but you should be able to modify it to use the character encodings that are normally used where you live. If you run this program in a terminal, only one line will be displayed correctly, and its first column is the character encoding of your terminal.

This means that you tried to use decoded string data in a context where it only makes sense to have binary data, in this case printing it. You can make the warning go away by using an appropriate output layer, or by piping the offending string through Encode::encode first. Sometimes you want to inspect if a string from an unknown source has already been decoded. Since Perl has no separate data types for binary strings and decoded strings, you can't do that reliably.

But there is a way to guess the answer by using the module Devel::Peek. But there is a big caveat: Just because the UTF8 flag isn't present doesn't mean that the text string hasn't been decoded. Perl uses either Latin-1 or UTF-8 internally to store strings, and the presence of this flag indicates which one is used. That also implies that if your program is written in Perl only and has no XS components it is almost certainly an error to rely on the presence or absence of that flag.

You shouldn't care how perl stores its strings anyway. A common source of errors are buggy modules. The pragma encoding looks very tempting:. When you write a CGI script you have to chose a character encoding, print all your data in that encoding, and write it in the HTTP headers. For most applications, UTF-8 is a good choice, since you can code arbitrary Unicode codepoints with it. On the other hand English text and of most other European languages is encoded very efficiently.

HTTP offers the Accept-Charset -Header in which the client can tell the server which character encodings it can handle.

But if you stick to the common encodings like UTF-8 or Latin-1, next to all user agents will understand it, so it isn't really necessary to check that header. Older versions prior to 3. Therefore you should not to use the charset routine and explicitly decode the parameter strings yourself.

If you use a template system, you should take care to choose one that knows how to handle character encodings.

There are a plethora of Perl modules out there that handle text, so here are only a few notable ones, and what you have to do to make them Unicode-aware:. That way the character encoding information sent in the HTTP response header is used to decode the body of the response. DBI leaves handling of character encodings to the DBD:: driver modules, so what you have to do depends on which database backend you are using.

What most of them have in common is that UTF-8 is better supported than other encodings. With the basic charset and Perl knowledge you can get quite far. For example, you can make a web application "Unicode safe", i. But that's not all there is to know on the topic. For example, the Unicode standard allows different ways to compose some characters, so you need to "normalize" them before you can compare two strings.

You can read more about that in the Unicode normalization FAQ. To implement country specific behaviour in programs, you should take a look at the locales system. Many programmers who are confronted with encoding issues first react with "But shouldn't it just work? Yes, it should just work. But too many systems are broken by design regarding character sets and encodings.

A classical example is the Internet Relay Chat IRC , which specifies that a character is one Byte, but not which character encoding is used. This worked well in the Latin-1 days, but was bound to fail as soon as people from different continents started to use it. Currently, many IRC clients try to autodetect character encodings, and recode it to what the user configured.



0コメント

  • 1000 / 1000