wget2  2.0.0
URIs/IRIs

Data Structures

struct  wget_iri_st
 
struct  iri_scheme
 

Functions

const char * wget_iri_scheme_get_name (wget_iri_scheme scheme)
 
bool wget_iri_supported (const wget_iri *iri)
 
bool wget_iri_isgendelim (char c)
 
bool wget_iri_issubdelim (char c)
 
bool wget_iri_isreserved (char c)
 
bool wget_iri_isunreserved (char c)
 
char * wget_iri_unescape_inline (char *src)
 
char * wget_iri_unescape_url_inline (char *src)
 
void wget_iri_free_content (wget_iri *iri)
 
void wget_iri_free (wget_iri **iri)
 
wget_iriwget_iri_parse (const char *url, const char *encoding)
 
wget_iriwget_iri_clone (const wget_iri *iri)
 
const char * wget_iri_get_connection_part (const wget_iri *iri, wget_buffer *buf)
 
const char * wget_iri_relative_to_abs (const wget_iri *base, const char *val, size_t len, wget_buffer *buf)
 
wget_iriwget_iri_parse_base (const wget_iri *base, const char *url, const char *encoding)
 
int wget_iri_compare (wget_iri *iri1, wget_iri *iri2)
 
const char * wget_iri_escape (const char *src, wget_buffer *buf)
 
const char * wget_iri_escape_path (const char *src, wget_buffer *buf)
 
const char * wget_iri_escape_query (const char *src, wget_buffer *buf)
 
const char * wget_iri_get_escaped_host (const wget_iri *iri, wget_buffer *buf)
 
const char * wget_iri_get_escaped_resource (const wget_iri *iri, wget_buffer *buf)
 
char * wget_iri_get_path (const wget_iri *iri, wget_buffer *buf, const char *encoding)
 
char * wget_iri_get_query_as_filename (const wget_iri *iri, wget_buffer *buf, const char *encoding)
 
char * wget_iri_get_basename (const wget_iri *iri, wget_buffer *buf, const char *encoding, int flags)
 
void wget_iri_set_defaultpage (const char *page)
 
int wget_iri_set_defaultport (wget_iri_scheme scheme, uint16_t port)
 
wget_iri_scheme wget_iri_set_scheme (wget_iri *iri, wget_iri_scheme scheme)
 

Detailed Description

URI/IRI parsing and manipulation functions.

IRIs are processed according to RFC 3987. Functions that escape certain characters (such as wget_iri_escape()) work according to RFC 3986.

The wget_iri structure represents an IRI. You generate one from a string with wget_iri_parse() or wget_iri_parse_base(). You can use wget_iri_clone() to generate another identical wget_iri.

You can access each of the fields of a wget_iri (such as path) independently, and you can use the getters here to escape each of those parts, or for convenience (e.g wget_iri_get_escaped_host(), wget_iri_get_escaped_resource(), etc.).

URIs/IRIs are all internally treated in UTF-8. The parsing functions that generate a wget_iri structure (wget_iri_parse() and wget_iri_parse_base()) thus convert the input string to UTF-8 before anything else. These functions take an encoding parameter that tells which is the original encoding of that string.

Conversely, the getters (for example, wget_iri_get_path()) can convert the output string from UTF-8 to an encoding of choice. The desired encoding is also specified in the encoding parameter.

The encoding parameter, in all functions that accept it, is a string with the name of a character set supported by GNU libiconv. You can find such a list elsewhere, but popular examples are "utf-8", "utf-16" or "iso-8859-1".

Function Documentation

◆ wget_iri_scheme_get_name()

const char* wget_iri_scheme_get_name ( wget_iri_scheme  scheme)
Parameters
[in]schemeScheme to get name for
Returns
Name of scheme (e.g. "http" or "https") or NULL is not supported

Maps scheme to it's string representation.

◆ wget_iri_supported()

bool wget_iri_supported ( const wget_iri iri)
Parameters
[in]iriAn IRI
Returns
1 if the scheme is supported, 0 if not

Tells whether the IRI's scheme is supported or not.

◆ wget_iri_isgendelim()

bool wget_iri_isgendelim ( char  c)
Parameters
[in]cA character
Returns
1 if c is a generic delimiter, 0 if not

Tests whether c is a generic delimiter (gen-delim), according to RFC 3986, sect. 2.2.

◆ wget_iri_issubdelim()

bool wget_iri_issubdelim ( char  c)
Parameters
[in]cA character
Returns
1 if c is a subcomponent delimiter, 0 if not

Tests whether c is a subcomponent delimiter (sub-delim) according to RFC 3986, sect. 2.2.

◆ wget_iri_isreserved()

bool wget_iri_isreserved ( char  c)
Parameters
[in]cA character
Returns
1 if c is a reserved character, 0 if not

Tests whether c is a reserved character.

According to RFC 3986, sect. 2.2, the set of reserved characters is formed by the generic delimiters (gen-delims, wget_iri_isgendelim()) and the subcomponent delimiters (sub-delims, wget_iri_is_subdelim()).

This function is thus equivalent to:

return wget_iri_isgendelim(c) || wget_iri_issubdelim(c);

◆ wget_iri_isunreserved()

bool wget_iri_isunreserved ( char  c)
Parameters
[in]cA character
Returns
1 if c is an unreserved character, 0 if not

Tests whether c is an unreserved character.

◆ wget_iri_unescape_inline()

char* wget_iri_unescape_inline ( char *  src)
Parameters
[in]srcA string
Returns
A pointer to src, after the transformation is done

Unescape a string. All the percent-encoded characters (XX) are converted back to their original form.

The transformation is done inline, so src will be modified after this function returns. If no percent-encoded characters are found, the string is left untouched.

◆ wget_iri_unescape_url_inline()

char* wget_iri_unescape_url_inline ( char *  src)
Parameters
[in]srcA string
Returns
A pointer to src, after the transformation is done

Unescape a string except escaped generic delimiters (and escaped ''. The percent-encoded characters (XX) are converted back to their original form.

This variant of unescaping is helpful before an URL is being parsed, so that the parser recognizes e.g. 'http%3A//' as relative URL (path) and not as a scheme.

The transformation is done inline, so src will be modified after this function returns. If no characters were unescaped, the string is left untouched.

◆ wget_iri_free_content()

void wget_iri_free_content ( wget_iri iri)
Parameters
[in]iriAn IRI

Free the heap-allocated content of the provided IRI, but leave the rest of the fields.

This function frees the following fields of wget_iri:

  • host
  • path
  • query
  • fragment
  • connection_part

◆ wget_iri_free()

void wget_iri_free ( wget_iri **  iri)
Parameters
[in]iriA pointer to a pointer to an IRI (a wget_iri)

Destroy a wget_iri structure.

The provided pointer is set to NULL.

◆ wget_iri_parse()

wget_iri* wget_iri_parse ( const char *  url,
const char *  encoding 
)
Parameters
[in]urlA URL/IRI
[in]encodingOriginal encoding of url
Returns
A libwget IRI (wget_iri)

The host, path, query and fragment parts will be converted to UTF-8 from the encoding given in the parameter encoding. GNU libiconv is used to perform the conversion, so this value should be the name of a valid character set supported by that library, such as "utf-8" or "iso-8859-1".

◆ wget_iri_clone()

wget_iri* wget_iri_clone ( const wget_iri iri)
Parameters
[in]iriAn IRI
Returns
A new IRI, with the exact same contents as the provided one.

Clone the provided IRI.

◆ wget_iri_get_connection_part()

const char* wget_iri_get_connection_part ( const wget_iri iri,
wget_buffer buf 
)
Parameters
[in]iriAn IRI
[in]bufA buffer, where the resulting string will be put
Returns
The contents of the buffer buf

Append the connection part of the IRI iri to buf.

The connection part is formed by the scheme, the hostname, and optionally the port. For example:

https://localhost:8080
https://www.example.com

It may be of the form https://example.com:8080 if the port was provided when creating the IRI or of the form https://example.com otherwise.

◆ wget_iri_relative_to_abs()

const char* wget_iri_relative_to_abs ( const wget_iri base,
const char *  val,
size_t  len,
wget_buffer buf 
)
Parameters
[in]baseA base IRI
[in]valA path, or another URI
[in]lenLength of the string val or -1
[in]bufDestination buffer, where the result will be copied.
Returns
A new URI (string) which is based on the base IRI base provided, or NULL in case of error.

Calculates a new URI which is based on the provided IRI base.

Taking the IRI base as a starting point, a new URI is created with the path val, which may be a relative or absolute path, or even a whole URI. The result is returned as a string, and if the buffer buf is provided, it is also placed there.

If val is an absolute path (it begins with a /), it is normalized first. Then the provided IRI's path is replaced by that new path. If it's a relative path, the file name of the base IRI's path is replaced by that path. Finally, if val begins with a scheme (such as https://) that string is returned untouched, and placed in the buffer if provided.

If base is NULL, then val must itself be an absolute URI. Likewise, if buf is NULL, then val must also be an absolute URI.

if len is -1, the length of val will be the result from strlen(val).

◆ wget_iri_parse_base()

wget_iri* wget_iri_parse_base ( const wget_iri base,
const char *  url,
const char *  encoding 
)
Parameters
[in]baseThe base IRI
[in]urlA relative/absolute path (or a URI) to be appended to base
[in]encodingThe encoding of url (e.g. "utf-8" or "iso-8859-1")
Returns
A new IRI

Generate a new IRI by using the provided IRI base as a base and the path url.

This is equivalent to:

wget_iri *iri = wget_iri_parse(wget_iri_relative_to_abs(base, url, strlen(url), NULL), encoding);
return iri;

As such, url can be a relative or absolute path, or another URI.

If base is NULL, then the parameter url must itself be an absolute URI.

◆ wget_iri_compare()

int wget_iri_compare ( wget_iri iri1,
wget_iri iri2 
)
Parameters
[in]iri1An IRI
[in]iri2Another IRI
Returns
0 if both IRIs are equal according to RFC 2616 or a non-zero value otherwise

Compare two IRIs.

Comparison is performed according to RFC 2616, sect. 3.2.3.

This function uses wget_strcasecmp() to compare the various parts of the IRIs so a non-zero negative return value indicates that iri1 is less than iri2, whereas a positive value indicates iri1 is greater than iri2.

◆ wget_iri_escape()

const char* wget_iri_escape ( const char *  src,
wget_buffer buf 
)
Parameters
[in]srcA string, whose reserved characters are to be percent-encoded
[in]bufA buffer where the result will be copied.
Returns
The contents of the buffer buf after src has been encoded.

Escapes (using percent-encoding) all the reserved characters in the string src.

If src is NULL, the contents of the buffer buf are returned. buf cannot be NULL.

◆ wget_iri_escape_path()

const char* wget_iri_escape_path ( const char *  src,
wget_buffer buf 
)
Parameters
[in]srcA string, whose reserved characters are to be percent-encoded
[in]bufA buffer where the result will be copied.
Returns
The contents of the buffer buf after src has been encoded as described in https://datatracker.ietf.org/doc/html/rfc7230#section-5.3.1.

Escapes the path part of the URI suitable for GET/POST requests (origin-form). origin-form = absolute-path [ "?" query ] path-absolute = "/" [ segment-nz *( "/" segment ) ] segment-nz = 1*pchar segment = *pchar pchar = unreserved / pct-encoded / sub-delims / ":" / "@"

◆ wget_iri_escape_query()

const char* wget_iri_escape_query ( const char *  src,
wget_buffer buf 
)
Parameters
[in]srcA string, whose reserved characters are to be percent-encoded
[in]bufA buffer where the result will be copied.
Returns
The contents of the buffer buf after src has been encoded.

Escapes (using percent-encoding) all the reserved characters in the string src (just like wget_iri_escape()), but excluding the equal sign = and the ampersand &. This function is thus ideally suited for query parts of URIs.

◆ wget_iri_get_escaped_host()

const char* wget_iri_get_escaped_host ( const wget_iri iri,
wget_buffer buf 
)
Parameters
[in]iriAn IRI
[in]bufA buffer, where the resulting string will be put
Returns
The contents of the buffer buf

Return the host part of the provided IRI. It is placed in the buffer buf and also returned as a const char *.

The host is escaped using wget_iri_escape().

◆ wget_iri_get_escaped_resource()

const char* wget_iri_get_escaped_resource ( const wget_iri iri,
wget_buffer buf 
)
Parameters
[in]iriAn IRI
[in]bufA buffer, where the resulting string will be put
Returns
The contents of the buffer buf

Return the resource string, suitable for use in HTTP requests. Details: https://datatracker.ietf.org/doc/html/rfc7230#section-3.1.1 https://datatracker.ietf.org/doc/html/rfc7230#section-2.7 https://datatracker.ietf.org/doc/html/rfc3986#section-3.3

The resource string is comprised of the path, plus the query part, if present. Example:

/foo/bar/?param_1=one&param_2=two

Both the path and the query are escaped using wget_iri_escape_path() and wget_iri_escape_query(), respectively.

The resulting string is placed in the buffer buf and also returned as a const char *.

◆ wget_iri_get_path()

char* wget_iri_get_path ( const wget_iri iri,
wget_buffer buf,
const char *  encoding 
)
Parameters
[in]iriAn IRI
[in]bufA buffer, where the resulting string will be put
[in]encodingCharacter set the string should be converted to
Returns
The contents of the buffer buf

Get the path part of the provided IRI.

The path is appended to buf. If buf is non-empty and does not end with a path separator (/), then one is added before the path is appended to buf.

If encoding is provided, this function will try to convert the path (which is originally in UTF-8) to that encoding.

◆ wget_iri_get_query_as_filename()

char* wget_iri_get_query_as_filename ( const wget_iri iri,
wget_buffer buf,
const char *  encoding 
)
Parameters
[in]iriAn IRI
[in]bufA buffer, where the resulting string will be put
[in]encodingCharacter set the string should be converted to
Returns
The contents of the buffer buf

Take the query part, and escape the path separators (/), so that it can be used as part of a filename.

The resulting string will be placed in the buffer buf and also returned as a const char *. If the provided IRI has no query part, then the original contents of buf are returned and buf is kept untouched.

If encoding is provided, this function will try to convert the query (which is originally in UTF-8) to that encoding.

◆ wget_iri_get_basename()

char* wget_iri_get_basename ( const wget_iri iri,
wget_buffer buf,
const char *  encoding,
int  flags 
)
Parameters
[in]iriAn IRI
[in]bufA buffer, where the resulting string will be put
[in]encodingCharacter set the string should be converted to
Returns
The contents of the buffer buf

Get the filename of the path of the provided IRI.

This is similar to wget_iri_get_path(), but instead of returning the whole path it only returns the substring after the last occurrence of /. In other words, the filename of the path.

This is also known as the "basename" in the UNIX world, and the output of this function would be equivalent to the output of the basename(1) tool.

The path is copied into buf if it's empty. If the buffer buf is not empty, it is appended to it after a path separator (/).

If encoding is provided, this function will try to convert the path (which is originally in UTF-8) to that encoding.

◆ wget_iri_set_defaultport()

int wget_iri_set_defaultport ( wget_iri_scheme  scheme,
uint16_t  port 
)
Parameters
schemeThe scheme for the new default port
portThe new default port value for the given scheme
Returns
0: success -1: Unknown scheme

Set the default port for the given scheme.

◆ wget_iri_set_scheme()

wget_iri_scheme wget_iri_set_scheme ( wget_iri iri,
wget_iri_scheme  scheme 
)
Parameters
[in]iriAn IRI
[in]schemeA scheme, such as http or https.
Returns
The original scheme of IRI (ie. before the replacement)

Set the scheme of the provided IRI. The IRI's original scheme is replaced by the new one.

If the IRI was using a default port (such as 80 for HTTP or 443 for HTTPS) that port is modified as well to match the default port of the new scheme. Otherwise the port is left untouched.