API Reference¶

Return the form, data, and submit button to submit an HTML form.

html is the source HTML response, where the form to submit will be found.

form_type is one of the supported form types. The returned form is the one of the specified type with the highest probability and a minimum probability of min_proba. If there is no match, ValueError is raised.

Note

A probability is always a float in the [0, 1] range.

fields is a dictionary of key-value pairs of data to submit with the form, where keys are supported field types instead of actual form field names.

The resulting tuple contains:

The matching form.
A dictionary of data to submit with the form. It is the content of fields, with keys replaced by their corresponding form field names. Missing fields are silently dropped. When multiple field names matching a given field type are found, the field name with the highest probability is used.
The submit button of the form, or None if no submit button was found. If multiple submit buttons are found, the one with the highest probability is returned.

You can use the form2request library to turn the result into an HTTP request:

>>> form, data, submit_button = build_submission(html, "search", {"search query": "foo"})  
>>> request_data = form2request(form, data, click=submit_button)  

Classifiers¶

formasaurus.classifiers.extract_forms(tree_or_html: HtmlElement | str | bytes, proba: bool = False, threshold: float = 0.05, fields: bool = True)[source]¶

Return a list of (form_elem, form_info) tuples, one tuple for each form found on tree_or_html.

form_info are dict objects with the results of classify() or classify_proba`() calls, depending on proba.

tree_or_html is the HTML document from which form data should be extracted, either an lxml tree or HTML source code as a string or bytes.

proba determines whether form_info values in the result include probability data (True) or not (False, default).

threshold is the minimum probability, in the [0, 1] range, for data to be included in the result.

fields determines whether field type data is computed and included into the form_info values in the result (True, default) or not (False).

formasaurus.classifiers.classify(form, fields=True)[source]¶

Return {'form': 'type', 'fields': {'name': 'type', ...}} dict with form type and types of its visible submittable fields.

If fields argument is False, only information about form type is returned: {'form': 'type'}.

formasaurus.classifiers.classify_proba(form, threshold=0.0, fields=True)[source]¶

Return dict with probabilities of form and its fields belonging to various form and field classes:

{
    'form': {'type1': prob1, 'type2': prob2, ...},
    'fields': {
        'name': {'type1': prob1, 'type2': prob2, ...},
        ...
    }
}

form should be an lxml HTML <form> element. Only classes with probability >= threshold are preserved.

If fields is False, only information about the form is returned:

{
    'form': {'type1': prob1, 'type2': prob2, ...}
}

class formasaurus.classifiers.FormFieldClassifier(form_classifier=None, field_model=None)[source]¶

FormFieldClassifier detects HTML form and field types.

classmethod load(filename=None, autocreate=True, rebuild=False)[source]¶

Load extractor from file filename.

If the file is missing and autocreate option is True (default), the model is created using default parameters and training data. If filename is None then default model file name is used.

Example - load the default extractor:

ffc = FormFieldClassifier.load()

classmethod trained_on(data_folder)[source]¶: Return Formasaurus object trained on data from data_folder

train(annotations)[source]¶: Train FormFieldExtractor on a list of FormAnnotation objects.

classify(form, fields=True)[source]¶

Return {'form': 'type', 'fields': {'name': 'type', ...}} dict with form type and types of its visible submittable fields.

If fields argument is False, only information about form type is returned: {'form': 'type'}.

classify_proba(form, threshold=0.0, fields=True)[source]¶

Return dict with probabilities of form and its fields belonging to various form and field classes:

{
    'form': {'type1': prob1, 'type2': prob2, ...},
    'fields': {
        'name': {'type1': prob1, 'type2': prob2, ...},
        ...
    }
}

form should be an lxml HTML <form> element. Only classes with probability >= threshold are preserved.

If fields is False, only information about the form is returned:

{
    'form': {'type1': prob1, 'type2': prob2, ...}
}

extract_forms(tree_or_html, proba=False, threshold=0.05, fields=True)[source]¶

Given a lxml tree or HTML source code, return a list of (form_elem, form_info) tuples.

form_info dicts contain results of classify() or classify_proba`() calls, depending on proba parameter.

When fields is False, field type information is not computed.

property form_classes¶: Possible form classes

property field_classes¶: Possible field classes

class formasaurus.classifiers.FormClassifier(form_model=None, full_type_names=True)[source]¶

Convenience wrapper for scikit-learn based form type detection model.

classify(form)[source]¶: Return form class. form should be an lxml HTML <form> element.

classify_proba(form, threshold=0.0)[source]¶: Return form class. form should be an lxml HTML <form> element.

train(annotations)[source]¶: Train formtype_model on a list of FormAnnotation objects.

extract_forms(tree_or_html, proba=False, threshold=0.05)[source]¶: Given a lxml tree or HTML source code, return a list of (form_elem, form_info) tuples. form_info dicts contain results of classify() or classify_proba`() calls, depending on proba parameter.

formasaurus.classifiers.get_instance()[source]¶: Return a shared FormFieldClassifier instance

Field Type Detection¶

Field type detection model is two-stage:

First, we train Formasaurus form type detector.
Second, we use form type detector results to improve quality of field type detection.

We have correct form types available directly in training data, but in reality form type detecor will make mistakes at prediction time. So we have two options:

Use correct form labels in training, rely on noisy form labels at test/prediction time.
Use noisy (predicted) labels both at train and test time.

Strategy (2) leads to more regularized models which account for form type detector mistakes; strategy (1) uses more information.

Based on held-out dataset it looks like (1) produces better results.

We need noisy form type labels anyways, to check prediction quality. To get these ‘realistic’ noisy form type labels we split data into 10 folds, and then for each fold we predict its labels using form type detector trained on the rest 9 folds.

formasaurus.fieldtype_model.tokenize(string, pos=0, endpos=9223372036854775807)¶: Return a list of all non-overlapping matches of pattern in string.

formasaurus.fieldtype_model.scorer = make_scorer(flat_f1_score, response_method='predict', average=micro)¶: Default scorer for grid search. We’re optimizing for micro-averaged F1.

formasaurus.fieldtype_model.get_Xy(annotations, form_types, full_type_names=False)[source]¶: Return training data for field type detection.

formasaurus.fieldtype_model.get_form_features(form, form_type, field_elems=None)[source]¶: Return a list of feature dicts, a dict per visible submittable field in a <form> element.

formasaurus.fieldtype_model.get_model(use_precise_form_types=True)[source]¶: Return default CRF model

formasaurus.fieldtype_model.print_classification_report(annotations, n_splits=10, model=None)[source]¶: Evaluate model, print classification report

Form Type Detection¶

This module defines which features and which classifier the default form type detection model uses.

formasaurus.formtype_model.get_model(prob=True)[source]¶: Return a default model.

formasaurus.formtype_model.train(annotations, model=None, full_type_names=False)[source]¶: Train form type detection model on annotation data

formasaurus.formtype_model.get_realistic_form_labels(annotations, n_splits=10, model=None, full_type_names=True)[source]¶: Return form type labels which form type detection model is likely to produce.

formasaurus.formtype_model.print_classification_report(annotations, n_splits=10, model=None)[source]¶: Evaluate model, print classification report

This module provides scikit-learn transformers for extracting features from HTML forms.

For all features X is a list of lxml <form> elements.

class formasaurus.formtype_features.BaseFormFeatureExtractor[source]¶

class formasaurus.formtype_features.FormElements[source]¶: Features based on form HTML elements: counts of elements of different types, GET/POST form method.

class formasaurus.formtype_features.Bias[source]¶: The same as clf.intercept_, but with regularization applied. Used mostly for debugging.

class formasaurus.formtype_features.FormText[source]¶: Text contents inside the form.

class formasaurus.formtype_features.FormInputNames[source]¶: Names of all non-hidden <input> elements, joined to a single string.

class formasaurus.formtype_features.FormInputHiddenNames[source]¶: Names of all <input type=hidden> elements, joined to a single string.

class formasaurus.formtype_features.FormLinksText[source]¶: Text of all links inside the form. It is helpful because e.g. registration links inside login forms are common.

class formasaurus.formtype_features.SubmitText[source]¶: Text of all <submit> buttons, joined to a single string.

class formasaurus.formtype_features.FormUrl[source]¶: <form action> value

class formasaurus.formtype_features.FormCss[source]¶: Form CSS classes and ID

class formasaurus.formtype_features.FormInputTitle[source]¶: <input title=…> values

class formasaurus.formtype_features.FormLabelText[source]¶: <label> values

class formasaurus.formtype_features.FormInputCss[source]¶: CSS classes and IDs of <input> elemnts

class formasaurus.formtype_features.OldLoginformFeatures[source]¶: Features that loginform library used.

formasaurus.formtype_features.loginform_features(form)[source]¶: A dict with features from loginform library

Working with Training Data¶

A module for working with annotation data storage.

class formasaurus.storage.Storage(folder)[source]¶

A wrapper class for HTML forms annotation data storage. The goal is to store the type of each <form> element from a web page. The data is stored in a folder with the following structure:

config.json
index.json
html/
    example.org-0.html
    example.org-1.html
    foo.example.org-0.html
    ...

html folders contains raw contents of the webpages. index.json file contains a JSON dict with the following records:

"RELATIVE-PATH-TO-HTML-FILE": {
    "url": "URL",
    "forms": ["type1", "type2", ...],
    "visible_html_fields": [
        {"name1": "type1", "name2": "type2", ...},
        ...
    ],
}

Key is the relative path to a file with page contents (e.g. “html/example.org-1.html”). Values:

“url” is an URL the webpage is downloaded from.
“forms” contains an array of form type identifiers. There must be an identifier per each <form> element on a web page.
“visible_html_fields” contains an array of objects, one object per <form> element; each object is a mapping from field name to field type identifier.

Possible form and field types are stored in config.json file; you can read them using get_form_types() and get_field_types().

initialize(config, index=None)[source]¶: Create folders and files for a new storage

get_index()[source]¶: Read an index

write_index(index)[source]¶: Save an index

get_config()[source]¶: Read meta information, including form and field types

get_field_schema()[source]¶: Return AnnotationSchema instance. r.types is an OrderedDict with field type names {full_name: short_name}; r.types_inv is a {short_name: full_name} dict; r.na_value is a short name of type name used for unannotated fields.

get_form_schema()[source]¶: Return AnnotationSchema instance. r.types is an OrderedDict with form type names {full_name: short_name}; r.types_inv is a {short_name: full_name} dict; r.na_value is a short name of type name used for unannotated forms; r.skip_value is a short name of a type name which should be skipped.

add_result(html, url, form_answers=None, visible_html_fields=None, index=None, add_empty=True)[source]¶: Save HTML source and its <form> and form field types.

iter_annotations(index=None, drop_duplicates=True, drop_na=True, drop_skipped=True, simplify_form_types=False, simplify_field_types=False, verbose=False, leave=False)[source]¶: Return an iterator over FormAnnotation objects.

iter_trees(index=None)[source]¶: Return an iterator over (filename, tree, info) tuples where filename is a relative file name, tree is a lxml tree and info is a dictionary with annotation data.

get_tree(path, info=None)[source]¶: Load a single tree. path is a relative path to a file (key in index.json file), info is annotation data (value in index.json file).

check(verbose=True)[source]¶: Check that items in storage are correct; print the problems found. Return the number of errors found.

get_fingerprint(form)[source]¶: Return form fingerprint (a string that can be used for deduplication).

get_form_type_counts(drop_duplicates=True, drop_na=True, simplify=False, verbose=True)[source]¶: Return a {formtype: count} collections.Counter

print_form_type_counts(simplify=False, verbose=True)[source]¶: Print the number annotations of each form types in this storage

generate_filename(url)[source]¶: Return a name for a new file

class formasaurus.annotation.AnnotationSchema(types, types_inv, na_value, skip_value, simplify_map)¶

na_value¶: Alias for field number 2

simplify_map¶: Alias for field number 4

skip_value¶: Alias for field number 3

types¶: Alias for field number 0

types_inv¶: Alias for field number 1

class formasaurus.annotation.FormAnnotation(form, type, index, info, key, form_schema, field_schema)[source]¶

Annotated HTML form

property fields¶: {“field name”: “field type”} dict.

property fields_annotated¶: True if form has fields and all fields are annotated.

property fields_partially_annotated¶: True when some fields are annotated and some are not annotated.

property field_elems¶: Return a list of lxml Elements for fields which are annotated. Fields are returned in in order they appear in form; only visible submittable fields are considered.

property field_types¶: A list of field types, in order they appear in form. Only visible submittable fields are considered.

property field_types_full¶: A list of long field type names, in order they appear in form. Only visible submittable fields are considered.

property type_full¶: Full form type name

HTML Processing Utilities¶

HTML processing utilities

formasaurus.html.remove_by_xpath(tree, xpath)[source]¶: Remove all HTML elements which match a given XPath expression.

formasaurus.html.load_html(tree_or_html, base_url=None)[source]¶

Parse HTML data to a lxml tree. tree_or_html must be either unicode or utf8-encoded (even if original page declares a different encoding).

If tree_or_html is not a string then it is returned as-is.

formasaurus.html.get_cleaned_form_html(form, human_readable=True)[source]¶: Return a cleaned up version of <form> HTML contents. If human_readable is True, HTML is cleaned to make source code more readable for humans; otherwise it is cleaned to make rendered form more safe to render.

formasaurus.html.get_field_names(elems)[source]¶: Return unique name attributes

formasaurus.html.get_visible_fields(form)[source]¶: Return visible form fields (the ones users should fill).

formasaurus.html.get_fields_to_annotate(form)[source]¶

Return fields which should be annotated:

they should be visible to user, and
they should have non-empty name (i.e. affect form submission result).

formasaurus.html.escaped_with_field_highlighted(form_html, field_name)[source]¶: Return escaped HTML source code suitable for displaying; fields with name==field_name are highlighted.

formasaurus.html.highlight_fields(html, field_name)[source]¶: Return HTML source code with all fields with name==field_name highlighted by adding formasaurus-field-highlighted CSS class.

formasaurus.html.add_text_after(elem, text)[source]¶: Add text after elem

formasaurus.html.add_text_before(elem, text)[source]¶: Add text before elem

formasaurus.html.get_text_around_elems(tree, elems)[source]¶: Return (before, after) tuple with {elem: text} dicts containing text before a specified lxml DOM Element and after it.

formasaurus.formhash.get_form_hash(form, only_visible=True)[source]¶

Return a string which is the same for duplicate forms, but different for forms which are not the same.

If only_visible is True, hidden fields are not taken in account.

Other Utilities¶

formasaurus.utils.dependencies_string()[source]¶

Return a string with versions of formasaurus, numpy, scipy and scikit-learn.

Saved scikit-learn models may be not compatible between different numpy/scipy/scikit-learn versions; a string returned by this function can be used as a part of file name.

formasaurus.utils.add_scheme_if_missing(url)[source]¶

>>> add_scheme_if_missing("example.org")
'http://example.org'
>>> add_scheme_if_missing("https://example.org")
'https://example.org'

formasaurus.utils.get_domain(url)[source]¶

>>> get_domain('example.org')
'example'
>>> get_domain('foo.example.co.uk')
'example'

formasaurus.utils.inverse_mapping(dct)[source]¶

Return reverse mapping:

>>> inverse_mapping({'x': 5})
{5: 'x'}

formasaurus.utils.at_root(*args)[source]¶: Return path relative to formasaurus source code

formasaurus.utils.thresholded(dct, threshold)[source]¶

Return dict dct without all values less than threshold.

>>> thresholded({'foo': 0.5, 'bar': 0.1}, 0.5)
{'foo': 0.5}

>>> thresholded({'foo': 0.5, 'bar': 0.1, 'baz': 1.0}, 0.6)
{'baz': 1.0}

>>> dct = {'foo': 0.5, 'bar': 0.1, 'baz': 1.0, 'spam': 0.0}
>>> thresholded(dct, 0.0) == dct
True

formasaurus.utils.download(url)[source]¶: Download a web page from url, return its content as unicode.

formasaurus.utils.response2unicode(resp)[source]¶: Convert requests.Response body to unicode. Unlike response.text it handles <meta> tags in response content.

formasaurus.text.tokenize(string, pos=0, endpos=9223372036854775807)¶: Tokenize text

formasaurus.text.ngrams(seq, min_n, max_n)[source]¶: Return min_n to max_n n-grams of elements from a given sequence.

formasaurus.text.token_ngrams(tokens, min_n, max_n)[source]¶: Return n-grams given a list of tokens.

formasaurus.text.normalize_whitespaces(text)[source]¶: Replace newlines and whitespaces with a single white space

formasaurus.text.normalize(text)[source]¶: Default text normalization function

formasaurus.text.number_pattern(text, ratio=0.3)[source]¶: Replace digits with X and letters with C if text contains > ratio of digits; return empty string otherwise.

API Reference¶

Classifiers¶

Field Type Detection¶

Form Type Detection¶

Working with Training Data¶

HTML Processing Utilities¶

Other Utilities¶

IPython Annotation Widgets¶

Formasaurus

Navigation