wget2 2.1.0
|
Data Structures | |
struct | wget_iri_st |
struct | iri_scheme |
Functions | |
const char * | wget_iri_scheme_get_name (wget_iri_scheme scheme) |
bool | wget_iri_supported (const wget_iri *iri) |
bool | wget_iri_isgendelim (char c) |
bool | wget_iri_issubdelim (char c) |
bool | wget_iri_isreserved (char c) |
bool | wget_iri_isunreserved (char c) |
char * | wget_iri_unescape_inline (char *src) |
char * | wget_iri_unescape_url_inline (char *src) |
void | wget_iri_free_content (wget_iri *iri) |
void | wget_iri_free (wget_iri **iri) |
wget_iri * | wget_iri_parse (const char *url, const char *encoding) |
wget_iri * | wget_iri_clone (const wget_iri *iri) |
const char * | wget_iri_get_connection_part (const wget_iri *iri, wget_buffer *buf) |
const char * | wget_iri_relative_to_abs (const wget_iri *base, const char *val, size_t len, wget_buffer *buf) |
wget_iri * | wget_iri_parse_base (const wget_iri *base, const char *url, const char *encoding) |
int | wget_iri_compare (const wget_iri *iri1, const wget_iri *iri2) |
const char * | wget_iri_escape (const char *src, wget_buffer *buf) |
const char * | wget_iri_escape_path (const char *src, wget_buffer *buf) |
const char * | wget_iri_escape_query (const char *src, wget_buffer *buf) |
const char * | wget_iri_get_escaped_host (const wget_iri *iri, wget_buffer *buf) |
const char * | wget_iri_get_escaped_resource (const wget_iri *iri, wget_buffer *buf) |
char * | wget_iri_get_path (const wget_iri *iri, wget_buffer *buf, const char *encoding) |
char * | wget_iri_get_query_as_filename (const wget_iri *iri, wget_buffer *buf, const char *encoding) |
char * | wget_iri_get_basename (const wget_iri *iri, wget_buffer *buf, const char *encoding, int flags) |
void | wget_iri_set_defaultpage (const char *page) |
int | wget_iri_set_defaultport (wget_iri_scheme scheme, uint16_t port) |
wget_iri_scheme | wget_iri_set_scheme (wget_iri *iri, wget_iri_scheme scheme) |
URI/IRI parsing and manipulation functions.
IRIs are processed according to RFC 3987. Functions that escape certain characters (such as wget_iri_escape()) work according to RFC 3986.
The wget_iri structure represents an IRI. You generate one from a string with wget_iri_parse() or wget_iri_parse_base(). You can use wget_iri_clone() to generate another identical wget_iri.
You can access each of the fields of a wget_iri (such as path
) independently, and you can use the getters here to escape each of those parts, or for convenience (e.g wget_iri_get_escaped_host(), wget_iri_get_escaped_resource(), etc.).
URIs/IRIs are all internally treated in UTF-8. The parsing functions that generate a wget_iri structure (wget_iri_parse() and wget_iri_parse_base()) thus convert the input string to UTF-8 before anything else. These functions take an encoding
parameter that tells which is the original encoding of that string.
Conversely, the getters (for example, wget_iri_get_path()) can convert the output string from UTF-8 to an encoding of choice. The desired encoding is also specified in the encoding
parameter.
The encoding
parameter, in all functions that accept it, is a string with the name of a character set supported by GNU libiconv. You can find such a list elsewhere, but popular examples are "utf-8", "utf-16" or "iso-8859-1".
const char * wget_iri_scheme_get_name | ( | wget_iri_scheme | scheme | ) |
[in] | scheme | Scheme to get name for |
scheme
(e.g. "http" or "https") or NULL is not supportedMaps scheme
to it's string representation.
bool wget_iri_supported | ( | const wget_iri * | iri | ) |
[in] | iri | An IRI |
Tells whether the IRI's scheme is supported or not.
bool wget_iri_isgendelim | ( | char | c | ) |
[in] | c | A character |
c
is a generic delimiter, 0 if notTests whether c
is a generic delimiter (gen-delim), according to RFC 3986, sect. 2.2.
bool wget_iri_issubdelim | ( | char | c | ) |
[in] | c | A character |
c
is a subcomponent delimiter, 0 if notTests whether c
is a subcomponent delimiter (sub-delim) according to RFC 3986, sect. 2.2.
bool wget_iri_isreserved | ( | char | c | ) |
[in] | c | A character |
c
is a reserved character, 0 if notTests whether c
is a reserved character.
According to RFC 3986, sect. 2.2, the set of reserved characters is formed by the generic delimiters (gen-delims, wget_iri_isgendelim()) and the subcomponent delimiters (sub-delims, wget_iri_is_subdelim()).
This function is thus equivalent to:
return wget_iri_isgendelim(c) || wget_iri_issubdelim(c);
bool wget_iri_isunreserved | ( | char | c | ) |
[in] | c | A character |
c
is an unreserved character, 0 if notTests whether c
is an unreserved character.
char * wget_iri_unescape_inline | ( | char * | src | ) |
[in] | src | A string |
src
, after the transformation is doneUnescape a string. All the percent-encoded characters (XX
) are converted back to their original form.
The transformation is done inline, so src
will be modified after this function returns. If no percent-encoded characters are found, the string is left untouched.
char * wget_iri_unescape_url_inline | ( | char * | src | ) |
[in] | src | A string |
src
, after the transformation is doneUnescape a string except escaped generic delimiters (and escaped ''. The percent-encoded characters (XX
) are converted back to their original form.
This variant of unescaping is helpful before an URL is being parsed, so that the parser recognizes e.g. 'http%3A//' as relative URL (path) and not as a scheme.
The transformation is done inline, so src
will be modified after this function returns. If no characters were unescaped, the string is left untouched.
void wget_iri_free_content | ( | wget_iri * | iri | ) |
[in] | iri | An IRI |
Free the heap-allocated content of the provided IRI, but leave the rest of the fields.
This function frees the following fields of wget_iri:
host
path
query
fragment
connection_part
void wget_iri_free | ( | wget_iri ** | iri | ) |
wget_iri * wget_iri_parse | ( | const char * | url, |
const char * | encoding | ||
) |
[in] | url | A URL/IRI |
[in] | encoding | Original encoding of url |
wget_iri
)The host, path, query and fragment parts will be converted to UTF-8 from the encoding given in the parameter encoding
. GNU libiconv is used to perform the conversion, so this value should be the name of a valid character set supported by that library, such as "utf-8" or "iso-8859-1".
[in] | iri | An IRI |
Clone the provided IRI.
const char * wget_iri_get_connection_part | ( | const wget_iri * | iri, |
wget_buffer * | buf | ||
) |
[in] | iri | An IRI |
[in] | buf | A buffer, where the resulting string will be put |
buf
Append the connection part of the IRI iri
to buf
.
The connection part is formed by the scheme, the hostname, and optionally the port. For example:
https://localhost:8080 https://www.example.com
It may be of the form https://example.com:8080
if the port was provided when creating the IRI or of the form https://example.com
otherwise.
const char * wget_iri_relative_to_abs | ( | const wget_iri * | base, |
const char * | val, | ||
size_t | len, | ||
wget_buffer * | buf | ||
) |
[in] | base | A base IRI |
[in] | val | A path, or another URI |
[in] | len | Length of the string val or -1 |
[in] | buf | Destination buffer, where the result will be copied. |
base
provided, or NULL in case of error.Calculates a new URI which is based on the provided IRI base
.
Taking the IRI base
as a starting point, a new URI is created with the path val
, which may be a relative or absolute path, or even a whole URI. The result is returned as a string, and if the buffer buf
is provided, it is also placed there.
If val
is an absolute path (it begins with a /
), it is normalized first. Then the provided IRI's path is replaced by that new path. If it's a relative path, the file name of the base
IRI's path is replaced by that path. Finally, if val
begins with a scheme (such as https://
) that string is returned untouched, and placed in the buffer if provided.
If base
is NULL, then val
must itself be an absolute URI. Likewise, if buf
is NULL, then val
must also be an absolute URI.
if len
is -1
, the length of val
will be the result from strlen(val)
.
[in] | base | The base IRI |
[in] | url | A relative/absolute path (or a URI) to be appended to base |
[in] | encoding | The encoding of url (e.g. "utf-8" or "iso-8859-1") |
Generate a new IRI by using the provided IRI base
as a base and the path url
.
This is equivalent to:
wget_iri *iri = wget_iri_parse(wget_iri_relative_to_abs(base, url, strlen(url), NULL), encoding); return iri;
As such, url
can be a relative or absolute path, or another URI.
If base
is NULL, then the parameter url
must itself be an absolute URI.
[in] | iri1 | An IRI |
[in] | iri2 | Another IRI |
Compare two IRIs.
Comparison is performed according to RFC 2616, sect. 3.2.3.
This function uses wget_strcasecmp() to compare the various parts of the IRIs so a non-zero negative return value indicates that iri1
is less than iri2
, whereas a positive value indicates iri1
is greater than iri2
.
const char * wget_iri_escape | ( | const char * | src, |
wget_buffer * | buf | ||
) |
[in] | src | A string, whose reserved characters are to be percent-encoded |
[in] | buf | A buffer where the result will be copied. |
buf
after src
has been encoded.Escapes (using percent-encoding) all the reserved characters in the string src
.
If src
is NULL, the contents of the buffer buf
are returned. buf
cannot be NULL.
const char * wget_iri_escape_path | ( | const char * | src, |
wget_buffer * | buf | ||
) |
[in] | src | A string, whose reserved characters are to be percent-encoded |
[in] | buf | A buffer where the result will be copied. |
buf
after src
has been encoded as described in https://datatracker.ietf.org/doc/html/rfc7230#section-5.3.1.Escapes the path part of the URI suitable for GET/POST requests (origin-form). origin-form = absolute-path [ "?" query ] path-absolute = "/" [ segment-nz *( "/" segment ) ] segment-nz = 1*pchar segment = *pchar pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
const char * wget_iri_escape_query | ( | const char * | src, |
wget_buffer * | buf | ||
) |
[in] | src | A string, whose reserved characters are to be percent-encoded |
[in] | buf | A buffer where the result will be copied. |
buf
after src
has been encoded.Escapes (using percent-encoding) all the reserved characters in the string src
(just like wget_iri_escape()), but excluding the equal sign =
and the ampersand &
. This function is thus ideally suited for query parts of URIs.
const char * wget_iri_get_escaped_host | ( | const wget_iri * | iri, |
wget_buffer * | buf | ||
) |
[in] | iri | An IRI |
[in] | buf | A buffer, where the resulting string will be put |
buf
Return the host part of the provided IRI. It is placed in the buffer buf
and also returned as a const char *
.
The host is escaped using wget_iri_escape().
const char * wget_iri_get_escaped_resource | ( | const wget_iri * | iri, |
wget_buffer * | buf | ||
) |
[in] | iri | An IRI |
[in] | buf | A buffer, where the resulting string will be put |
buf
Return the resource string, suitable for use in HTTP requests. Details: https://datatracker.ietf.org/doc/html/rfc7230#section-3.1.1 https://datatracker.ietf.org/doc/html/rfc7230#section-2.7 https://datatracker.ietf.org/doc/html/rfc3986#section-3.3
The resource string is comprised of the path, plus the query part, if present. Example:
/foo/bar/?param_1=one¶m_2=two
Both the path and the query are escaped using wget_iri_escape_path() and wget_iri_escape_query(), respectively.
The resulting string is placed in the buffer buf
and also returned as a const char *
.
char * wget_iri_get_path | ( | const wget_iri * | iri, |
wget_buffer * | buf, | ||
const char * | encoding | ||
) |
[in] | iri | An IRI |
[in] | buf | A buffer, where the resulting string will be put |
[in] | encoding | Character set the string should be converted to |
buf
Get the path part of the provided IRI.
The path is appended to buf
. If buf
is non-empty and does not end with a path separator (/
), then one is added before the path is appended to buf
.
If encoding
is provided, this function will try to convert the path (which is originally in UTF-8) to that encoding.
char * wget_iri_get_query_as_filename | ( | const wget_iri * | iri, |
wget_buffer * | buf, | ||
const char * | encoding | ||
) |
[in] | iri | An IRI |
[in] | buf | A buffer, where the resulting string will be put |
[in] | encoding | Character set the string should be converted to |
buf
Take the query part, and escape the path separators (/
), so that it can be used as part of a filename.
The resulting string will be placed in the buffer buf
and also returned as a const char *
. If the provided IRI has no query part, then the original contents of buf
are returned and buf
is kept untouched.
If encoding
is provided, this function will try to convert the query (which is originally in UTF-8) to that encoding.
char * wget_iri_get_basename | ( | const wget_iri * | iri, |
wget_buffer * | buf, | ||
const char * | encoding, | ||
int | flags | ||
) |
[in] | iri | An IRI |
[in] | buf | A buffer, where the resulting string will be put |
[in] | encoding | Character set the string should be converted to |
buf
Get the filename of the path of the provided IRI.
This is similar to wget_iri_get_path(), but instead of returning the whole path it only returns the substring after the last occurrence of /
. In other words, the filename of the path.
This is also known as the "basename" in the UNIX world, and the output of this function would be equivalent to the output of the basename(1)
tool.
The path is copied into buf
if it's empty. If the buffer buf
is not empty, it is appended to it after a path separator (/
).
If encoding
is provided, this function will try to convert the path (which is originally in UTF-8) to that encoding.
int wget_iri_set_defaultport | ( | wget_iri_scheme | scheme, |
uint16_t | port | ||
) |
scheme | The scheme for the new default port |
port | The new default port value for the given scheme |
Set the default port
for the given scheme
.
wget_iri_scheme wget_iri_set_scheme | ( | wget_iri * | iri, |
wget_iri_scheme | scheme | ||
) |
[in] | iri | An IRI |
[in] | scheme | A scheme, such as http or https . |
Set the scheme of the provided IRI. The IRI's original scheme is replaced by the new one.
If the IRI was using a default port (such as 80 for HTTP or 443 for HTTPS) that port is modified as well to match the default port of the new scheme. Otherwise the port is left untouched.