This module defines base classes for standard Python codecs (encoders
and decoders) and provides access to the internal Python codec
registry which manages the codec and error handling lookup process.
It defines the following functions:
register( |
search_function) |
-
Register a codec search function. Search functions are expected to
take one argument, the encoding name in all lower case letters, and
return a CodecInfo object having the following attributes:
name The name of the encoding;
encoder The stateless encoding function;
decoder The stateless decoding function;
incrementalencoder An incremental encoder class or factory function;
incrementaldecoder An incremental decoder class or factory function;
streamwriter A stream writer class or factory function;
streamreader A stream reader class or factory function.
The various functions or classes take the following arguments:
encoder and decoder: These must be functions or methods
which have the same interface as the
encode()/decode() methods of Codec instances (see
Codec Interface). The functions/methods are expected to work in a
stateless mode.
incrementalencoder and incrementalencoder: These have to be
factory functions providing the following interface:
factory(errors='strict')
The factory functions must return objects providing the interfaces
defined by the base classes IncrementalEncoder and
IncrementalEncoder, respectively. Incremental codecs can maintain
state.
streamreader and streamwriter: These have to be
factory functions providing the following interface:
factory(stream, errors='strict')
The factory functions must return objects providing the interfaces
defined by the base classes StreamWriter and
StreamReader, respectively. Stream codecs can maintain
state.
Possible values for errors are 'strict' (raise an exception
in case of an encoding error), 'replace' (replace malformed
data with a suitable replacement marker, such as "?"),
'ignore' (ignore malformed data and continue without further
notice), 'xmlcharrefreplace' (replace with the appropriate XML
character reference (for encoding only)) and 'backslashreplace'
(replace with backslashed escape sequences (for encoding only)) as
well as any other error handling name defined via
register_error().
In case a search function cannot find a given encoding, it should
return None .
-
Looks up the codec info in the Python codec registry and returns a
CodecInfo object as defined above.
Encodings are first looked up in the registry's cache. If not found,
the list of registered search functions is scanned. If no CodecInfo
object is found, a LookupError is raised. Otherwise, the
CodecInfo object is stored in the cache and returned to the caller.
To simplify access to the various codecs, the module provides these
additional functions which use lookup() for the codec
lookup:
-
Look up the codec for the given encoding and return its encoder
function.
Raises a LookupError in case the encoding cannot be found.
-
Look up the codec for the given encoding and return its decoder
function.
Raises a LookupError in case the encoding cannot be found.
getincrementalencoder( |
encoding) |
-
Look up the codec for the given encoding and return its incremental encoder
class or factory function.
Raises a LookupError in case the encoding cannot be found or the
codec doesn't support an incremental encoder.
New in version 2.5.
getincrementaldecoder( |
encoding) |
-
Look up the codec for the given encoding and return its incremental decoder
class or factory function.
Raises a LookupError in case the encoding cannot be found or the
codec doesn't support an incremental decoder.
New in version 2.5.
-
Look up the codec for the given encoding and return its StreamReader
class or factory function.
Raises a LookupError in case the encoding cannot be found.
-
Look up the codec for the given encoding and return its StreamWriter
class or factory function.
Raises a LookupError in case the encoding cannot be found.
register_error( |
name, error_handler) |
-
Register the error handling function error_handler under the
name name. error_handler will be called during encoding
and decoding in case of an error, when name is specified as the
errors parameter.
For encoding error_handler will be called with a
UnicodeEncodeError instance, which contains information about
the location of the error. The error handler must either raise this or
a different exception or return a tuple with a replacement for the
unencodable part of the input and a position where encoding should
continue. The encoder will encode the replacement and continue encoding
the original input at the specified position. Negative position values
will be treated as being relative to the end of the input string. If the
resulting position is out of bound an IndexError will be raised.
Decoding and translating works similar, except UnicodeDecodeError
or UnicodeTranslateError will be passed to the handler and
that the replacement from the error handler will be put into the output
directly.
-
Return the error handler previously registered under the name name.
Raises a LookupError in case the handler cannot be found.
strict_errors( |
exception) |
-
Implements the
strict error handling.
replace_errors( |
exception) |
-
Implements the
replace error handling.
ignore_errors( |
exception) |
-
Implements the
ignore error handling.
xmlcharrefreplace_errors_errors( |
exception) |
-
Implements the
xmlcharrefreplace error handling.
backslashreplace_errors_errors( |
exception) |
-
Implements the
backslashreplace error handling.
To simplify working with encoded files or stream, the module
also defines these utility functions:
open( |
filename, mode[, encoding[,
errors[, buffering]]]) |
-
Open an encoded file using the given mode and return
a wrapped version providing transparent encoding/decoding.
Note:
The wrapped version will only accept the object format
defined by the codecs, i.e. Unicode objects for most built-in
codecs. Output is also codec-dependent and will usually be Unicode as
well.
encoding specifies the encoding which is to be used for the
file.
errors may be given to define the error handling. It defaults
to 'strict' which causes a ValueError to be raised
in case an encoding error occurs.
buffering has the same meaning as for the built-in
open() function. It defaults to line buffered.
EncodedFile( |
file, input[,
output[, errors]]) |
-
Return a wrapped version of file which provides transparent
encoding translation.
Strings written to the wrapped file are interpreted according to the
given input encoding and then written to the original file as
strings using the output encoding. The intermediate encoding will
usually be Unicode but depends on the specified codecs.
If output is not given, it defaults to input.
errors may be given to define the error handling. It defaults to
'strict' , which causes ValueError to be raised in case
an encoding error occurs.
iterencode( |
iterable, encoding[, errors]) |
-
Uses an incremental encoder to iteratively encode the input provided by
iterable. This function is a generator. errors (as well as
any other keyword argument) is passed through to the incremental encoder.
New in version 2.5.
iterdecode( |
iterable, encoding[, errors]) |
-
Uses an incremental decoder to iteratively decode the input provided by
iterable. This function is a generator. errors (as well as
any other keyword argument) is passed through to the incremental encoder.
New in version 2.5.
The module also provides the following constants which are useful
for reading and writing to platform dependent files:
- BOM
-
- BOM_BE
-
- BOM_LE
-
- BOM_UTF8
-
- BOM_UTF16
-
- BOM_UTF16_BE
-
- BOM_UTF16_LE
-
- BOM_UTF32
-
- BOM_UTF32_BE
-
- BOM_UTF32_LE
-
These constants define various encodings of the Unicode byte order mark
(BOM) used in UTF-16 and UTF-32 data streams to indicate the byte order
used in the stream or file and in UTF-8 as a Unicode signature.
BOM_UTF16 is either BOM_UTF16_BE or
BOM_UTF16_LE depending on the platform's native byte order,
BOM is an alias for BOM_UTF16, BOM_LE
for BOM_UTF16_LE and BOM_BE for BOM_UTF16_BE.
The others represent the BOM in UTF-8 and UTF-32 encodings.
Release 2.5.2, documentation updated on 21st February, 2008.
See About this document... for information on suggesting changes.
|