parse

This module defines a standard interface to break Uniform Resource Locator (URL) strings up in components (addressing scheme, network location, path etc.)

Direct copy from: parse.

parse.parse_qs(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace', max_num_fields=None, separator='&')

Parse a query given as a string argument.

Parameters:

  • qs - percent-encoded query string to be parsed.

  • keep_blank_values - flag indicating whether blank values in percent-encoded queries should be treated as blank strings. A true value indicates that blanks should be retained as blank strings. The default false value indicates that blank values are to be ignored and treated as if they were not included.

  • strict_parsing - flag indicating what to do with parsing errors. If false (the default), errors are silently ignored. If true, errors raise a ValueError exception.

  • and errors (encoding) - specify how to decode percent-encoded sequences into Unicode characters, as accepted by the bytes.decode() method.

  • max_num_fields - int. If set, then throws a ValueError if there are more than n fields read by parse_qsl().

  • separator - str. The symbol to use for separating the query arguments. Defaults to &.

Returns: parse query dictionary.

parse.parse_qsl(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace', max_num_fields=None, separator='&')

Parse a query given as a string argument.

Parameters:

  • qs - percent-encoded query string to be parsed.

  • keep_blank_values - flag indicating whether blank values in percent-encoded queries should be treated as blank strings. A true value indicates that blanks should be retained as blank strings. The default false value indicates that blank values are to be ignored and treated as if they were not included.

  • strict_parsing - flag indicating what to do with parsing errors. If false (the default), errors are silently ignored. If true, errors raise a ValueError exception.

  • and errors (encoding) - specify how to decode percent-encoded sequences into Unicode characters, as accepted by the bytes.decode() method.

  • max_num_fields - int. If set, then throws a ValueError if there are more than n fields read by parse_qsl().

  • separator - str. The symbol to use for separating the query arguments. Defaults to &.

Returns: list, as G-d intended.

parse.quote(string, safe='/', encoding=None, errors=None)

Each part of a URL, e.g. the path info, the query, etc., has a different set of reserved characters that must be quoted. The quote function offers a cautious (not minimal) way to quote a string for most of these parts.

RFC 3986 Uniform Resource Identifier (URI): Generic Syntax lists the following (un)reserved characters.

  • unreserved = ALPHA / DIGIT / “-” / “.” / “_” / “~”

  • reserved = gen-delims / sub-delims

  • gen-delims = “:” / “/” / “?” / “#” / “[” / “]” / “@”

  • sub-delims = “!” / “$” / “&” / “’” / “(” / “)” / “*” / “+” / “,” / “;” / “=”

Each of the reserved characters is reserved in some component of a URL, but not necessarily in all of them. The quote function %-escapes all characters that are neither in the unreserved chars (“always safe”) nor the additional chars set via the safe arg.

Example:

quote('abc def')
abc%20def

The default for the safe arg is '/'. The character is reserved, but in typical usage the quote function is being called on a path where the existing slash characters are to be preserved. Python 3.7 updates from using RFC 2396 to RFC 3986 to quote URL strings. Now, “~” is included in the set of unreserved characters.

string and safe may be either str or bytes objects. encoding and errors must not be specified if string is a bytes object.

The optional encoding and errors parameters specify how to deal with non-ASCII characters, as accepted by the str.encode method.

By default, encoding='utf-8' (characters are encoded with UTF-8), and errors='strict' (unsupported characters raise a UnicodeEncodeError)

Parameters:

  • string - may be either a str or a bytes object.

  • safe - optional parameter specifies additional ASCII characters that should not be quoted - its default value is '/'

  • encoding - specify how to deal with non-ASCII characters, as accepted by the str.encode() method. defaults to 'utf-8'.

  • errors - defaults to ‘strict’, meaning unsupported characters raise a UnicodeEncodeError.

Returns: quoted string.

parse.quote_from_bytes(bs, safe='/')

Like quote(), but accepts a bytes object rather than a str, and does not perform string-to-bytes encoding. It always returns an ASCII string.

Example:

quote_from_bytes(b'abc defx3f')
abc%20def%3f

Parameters:

bs - bytes to quote.

Returns: quoted ASCII string.

parse.quote_plus(string, safe='', encoding=None, errors=None)

Like quote(), but also replace ‘ ‘ with ‘+’, as required for quoting HTML form values. Plus signs in the original string are escaped unless they are included in safe. It also does not have safe default to ‘/’.

Parameters:

  • string - may be either a str or a bytes object.

  • safe - optional parameter specifies additional ASCII characters that should not be quoted - its default value is ‘/’.

  • encoding - specify how to deal with non-ASCII characters, as accepted by the str.encode() method. defaults to ‘utf-8’.

  • errors - defaults to ‘strict’, meaning unsupported characters raise a UnicodeEncodeError.

Returns: quoted string.

parse.unquote(string, encoding='utf-8', errors='replace')

Replace %xx escapes by their single-character equivalent. The optional encoding and errors parameters specify how to decode percent-encoded sequences into Unicode characters, as accepted by the bytes.decode() method. By default, percent-encoded sequences are decoded with UTF-8, and invalid sequences are replaced by a placeholder character.

Example:

unquote('abc%20def')
abc def

Parameters:

  • string - may be either a str or a bytes object.

  • encoding - specify how to deal with non-ASCII characters, as accepted by the str.encode() method. defaults to ‘utf-8’.

  • errors - defaults to ‘strict’, meaning unsupported characters raise a UnicodeEncodeError.

Returns: unquoted string.

parse.unwrap(url)

Extract the url from a wrapped URL (that is, a string formatted as URL:scheme://host/path, scheme://host/path, URL:scheme://host/path or scheme://host/path). If url is not a wrapped URL, it is returned without changes.

parse.urlencode(query, doseq=False, safe='', encoding=None, errors=None, quote_via=<function quote_plus>)

Convert a mapping object or a sequence of two-element tuples, which may contain str or bytes objects, to a percent-encoded ASCII text string. If the resultant string is to be used as a data for POST operation with the urlopen() function, then it should be encoded to bytes, otherwise it would result in a TypeError.

The resulting string is a series of key=value pairs separated by ‘&’ characters, where both key and value are quoted using the quote_via function. By default, quote_plus() is used to quote the values, which means spaces are quoted as a ‘+’ character and ‘/’ characters are encoded as %2F, which follows the standard for GET requests (application/x-www-form-urlencoded). An alternate function that can be passed as quote_via is quote(), which will encode spaces as %20 and not encode ‘/’ characters. For maximum control of what is quoted, use quote and specify a value for safe.

When a sequence of two-element tuples is used as the query argument, the first element of each tuple is a key and the second is a value. The value element in itself can be a sequence and in that case, if the optional parameter doseq evaluates to True, individual key=value pairs separated by ‘&’ are generated for each element of the value sequence for the key. The order of parameters in the encoded string will match the order of parameter tuples in the sequence.

The safe, encoding, and errors parameters are passed down to quote_via (the encoding and errors parameters are only passed when a query element is a str).

To reverse this encoding process, parse_qs() and parse_qsl() are provided in this module to parse query strings into Python data structures. Refer to urllib examples to find out how the urllib.parse.urlencode() method can be used for generating the query string of a URL or data for a POST request.

parse.urlparse(url, scheme='', allow_fragments=True)

Parse a URL into six components, returning a 6-item named tuple. This corresponds to the general structure of a URL: scheme://netloc/path;parameters?query#fragment. Each tuple item is a string, possibly empty. The components are not broken up into smaller parts (for example, the network location is a single string), and % escapes are not expanded. The delimiters as shown above are not part of the result, except for a leading slash in the path component, which is retained if present.

Example:

urlparse("scheme://netloc/path;parameters?query#fragment")
ParseResult(scheme='scheme', netloc='netloc', path='/path;parameters', params='',
            query='query', fragment='fragment')
o = urlparse("http://docs.python.org:80/3/library/urllib.parse.html?"
...              "highlight=params#url-parsing")
o
ParseResult(scheme='http', netloc='docs.python.org:80',
            path='/3/library/urllib.parse.html', params='',
            query='highlight=params', fragment='url-parsing')
o.scheme
http
o.netloc
docs.python.org:80
o.hostname
docs.python.org
o.port
80
o._replace(fragment="").geturl()

Parameters:

  • url - url string to parse.

  • scheme - default addressing scheme, to be used only if the URL does not specify one.

  • allow_fragments - if false, fragment identifiers are not recognized. Instead, they are parsed as part of the path, parameters or query component, and fragment is set to the empty string in the return value.

Returns: named tuple, which means that its items can be accessed by index or as named attributes.

parse.urlsplit(url, scheme='', allow_fragments=True)

This is similar to urlparse(), but does not split the params from the URL. This should generally be used instead of urlparse() if the more recent URL syntax allowing parameters to be applied to each segment of the path portion of the URL (see RFC 2396) is wanted. A separate function is needed to separate the path segments and parameters.

Parameters:

  • url - url string to parse.

  • scheme - default addressing scheme, to be used only if the URL does not specify one.

  • allow_fragments - if false, fragment identifiers are not recognized. Instead, they are parsed as part of the path, parameters or query component, and fragment is set to the empty string in the return value.

Returns: this function returns a 5-item named tuple.

parse.urlunparse(components)

Put a parsed URL back together again. This may result in a slightly different, but equivalent URL, if the URL that was parsed originally had redundant delimiters, e.g. a ? with an empty query (the draft states that these are equivalent).

Parameters:

components - url components.

Returns: URL string.

parse.urlunsplit(components)

Combine the elements of a tuple as returned by urlsplit() into a complete URL as a string. This may result in a slightly different, but equivalent URL, if the URL that was parsed originally had unnecessary delimiters (for example, a ? with an empty query; the RFC states that these are equivalent).

Parameters:

components - url components, can be any five-item iterable.

Returns: URL string.

Last updated