API Reference¶
Classifiers¶
-
formasaurus.classifiers.
extract_forms
(tree_or_html, proba=False, threshold=0.05, fields=True)[source]¶ Given a lxml tree or HTML source code, return a list of
(form_elem, form_info)
tuples.form_info
dicts contain results ofclassify()
orclassify_proba`()
calls, depending onproba
parameter.When
fields
is False, field type information is not computed.
-
formasaurus.classifiers.
classify
(form, fields=True)[source]¶ Return
{'form': 'type', 'fields': {'name': 'type', ...}}
dict with form type and types of its visible submittable fields.If
fields
argument is False, only information about form type is returned:{'form': 'type'}
.
-
formasaurus.classifiers.
classify_proba
(form, threshold=0.0, fields=True)[source]¶ Return dict with probabilities of
form
and its fields belonging to various form and field classes:{ 'form': {'type1': prob1, 'type2': prob2, ...}, 'fields': { 'name': {'type1': prob1, 'type2': prob2, ...}, ... } }
form
should be an lxml HTML <form> element. Only classes with probability >=threshold
are preserved.If
fields
is False, only information about the form is returned:{ 'form': {'type1': prob1, 'type2': prob2, ...} }
-
class
formasaurus.classifiers.
FormFieldClassifier
(form_classifier=None, field_model=None)[source]¶ FormFieldClassifier detects HTML form and field types.
-
classmethod
load
(filename=None, autocreate=True, rebuild=False)[source]¶ Load extractor from file
filename
.If the file is missing and
autocreate
option is True (default), the model is created using default parameters and training data. Iffilename
is None then default model file name is used.Example - load the default extractor:
ffc = FormFieldClassifier.load()
-
classmethod
trained_on
(data_folder)[source]¶ Return Formasaurus object trained on data from data_folder
-
classify
(form, fields=True)[source]¶ Return
{'form': 'type', 'fields': {'name': 'type', ...}}
dict with form type and types of its visible submittable fields.If
fields
argument is False, only information about form type is returned:{'form': 'type'}
.
-
classify_proba
(form, threshold=0.0, fields=True)[source]¶ Return dict with probabilities of
form
and its fields belonging to various form and field classes:{ 'form': {'type1': prob1, 'type2': prob2, ...}, 'fields': { 'name': {'type1': prob1, 'type2': prob2, ...}, ... } }
form
should be an lxml HTML <form> element. Only classes with probability >=threshold
are preserved.If
fields
is False, only information about the form is returned:{ 'form': {'type1': prob1, 'type2': prob2, ...} }
-
extract_forms
(tree_or_html, proba=False, threshold=0.05, fields=True)[source]¶ Given a lxml tree or HTML source code, return a list of
(form_elem, form_info)
tuples.form_info
dicts contain results ofclassify()
orclassify_proba`()
calls, depending onproba
parameter.When
fields
is False, field type information is not computed.
-
form_classes
¶ Possible form classes
-
field_classes
¶ Possible field classes
-
classmethod
-
class
formasaurus.classifiers.
FormClassifier
(form_model=None, full_type_names=True)[source]¶ Convenience wrapper for scikit-learn based form type detection model.
-
classify_proba
(form, threshold=0.0)[source]¶ Return form class.
form
should be an lxml HTML <form> element.
-
extract_forms
(tree_or_html, proba=False, threshold=0.05)[source]¶ Given a lxml tree or HTML source code, return a list of
(form_elem, form_info)
tuples.form_info
dicts contain results ofclassify()
orclassify_proba`()
calls, depending onproba
parameter.
-
Field Type Detection¶
Field type detection model is two-stage:
- First, we train Formasaurus form type detector.
- Second, we use form type detector results to improve quality of field type detection.
We have correct form types available directly in training data, but in reality form type detecor will make mistakes at prediction time. So we have two options:
- Use correct form labels in training, rely on noisy form labels at test/prediction time.
- Use noisy (predicted) labels both at train and test time.
Strategy (2) leads to more regularized models which account for form type detector mistakes; strategy (1) uses more information.
Based on held-out dataset it looks like (1) produces better results.
We need noisy form type labels anyways, to check prediction quality. To get these ‘realistic’ noisy form type labels we split data into 10 folds, and then for each fold we predict its labels using form type detector trained on the rest 9 folds.
-
formasaurus.fieldtype_model.
scorer
¶ Default scorer for grid search. We’re optimizing for micro-averaged F1.
-
formasaurus.fieldtype_model.
get_Xy
(annotations, form_types, full_type_names=False)[source]¶ Return training data for field type detection.
-
formasaurus.fieldtype_model.
get_form_features
(form, form_type, field_elems=None)[source]¶ Return a list of feature dicts, a dict per visible submittable field in a <form> element.
-
formasaurus.fieldtype_model.
get_model
(use_precise_form_types=True)[source]¶ Return default CRF model
-
formasaurus.fieldtype_model.
print_classification_report
(annotations, n_folds=10, model=None)[source]¶ Evaluate model, print classification report
-
formasaurus.fieldtype_model.
tokenize
()¶ findall(string[, pos[, endpos]]) -> list. Return a list of all non-overlapping matches of pattern in string.
Form Type Detection¶
This module defines which features and which classifier the default form type detection model uses.
-
formasaurus.formtype_model.
train
(annotations, model=None, full_type_names=False)[source]¶ Train form type detection model on annotation data
-
formasaurus.formtype_model.
get_realistic_form_labels
(annotations, n_folds=10, model=None, full_type_names=True)[source]¶ Return form type labels which form type detection model is likely to produce.
-
formasaurus.formtype_model.
print_classification_report
(annotations, n_folds=10, model=None)[source]¶ Evaluate model, print classification report
This module provides scikit-learn transformers for extracting features from HTML forms.
For all features X is a list of lxml <form> elements.
-
class
formasaurus.formtype_features.
FormElements
[source]¶ Features based on form HTML elements: counts of elements of different types, GET/POST form method.
-
class
formasaurus.formtype_features.
Bias
[source]¶ The same as
clf.intercept_
, but with regularization applied. Used mostly for debugging.
-
class
formasaurus.formtype_features.
FormInputNames
[source]¶ Names of all non-hidden <input> elements, joined to a single string.
-
class
formasaurus.formtype_features.
FormInputHiddenNames
[source]¶ Names of all <input type=hidden> elements, joined to a single string.
-
class
formasaurus.formtype_features.
FormLinksText
[source]¶ Text of all links inside the form. It is helpful because e.g. registration links inside login forms are common.
-
class
formasaurus.formtype_features.
SubmitText
[source]¶ Text of all <submit> buttons, joined to a single string.
Working with Training Data¶
A module for working with annotation data storage.
-
class
formasaurus.storage.
Storage
(folder)[source]¶ A wrapper class for HTML forms annotation data storage. The goal is to store the type of each <form> element from a web page. The data is stored in a folder with the following structure:
config.json index.json html/ example.org-0.html example.org-1.html foo.example.org-0.html ...
html
folders contains raw contents of the webpages.index.json
file contains a JSON dict with the following records:"RELATIVE-PATH-TO-HTML-FILE": { "url": "URL", "forms": ["type1", "type2", ...], "visible_html_fields": [ {"name1": "type1", "name2": "type2", ...}, ... ], }
Key is the relative path to a file with page contents (e.g. “html/example.org-1.html”). Values:
- “url” is an URL the webpage is downloaded from.
- “forms” contains an array of form type identifiers.
There must be an identifier per each
<form>
element on a web page. - “visible_html_fields” contains an array of objects, one object per
<form>
element; each object is a mapping from field name to field type identifier.
Possible form and field types are stored in
config.json
file; you can read them usingget_form_types()
andget_field_types()
.-
get_field_schema
()[source]¶ Return
AnnotationSchema
instance. r.types is an OrderedDict with field type names {full_name: short_name}; r.types_inv is a {short_name: full_name} dict; r.na_value is a short name of type name used for unannotated fields.
-
get_form_schema
()[source]¶ Return
AnnotationSchema
instance. r.types is an OrderedDict with form type names {full_name: short_name}; r.types_inv is a {short_name: full_name} dict; r.na_value is a short name of type name used for unannotated forms; r.skip_value is a short name of a type name which should be skipped.
-
add_result
(html, url, form_answers=None, visible_html_fields=None, index=None, add_empty=True)[source]¶ Save HTML source and its <form> and form field types.
-
iter_annotations
(index=None, drop_duplicates=True, drop_na=True, drop_skipped=True, simplify_form_types=False, simplify_field_types=False, verbose=False, leave=False)[source]¶ Return an iterator over
FormAnnotation
objects.
-
iter_trees
(index=None)[source]¶ Return an iterator over
(filename, tree, info)
tuples wherefilename
is a relative file name,tree
is a lxml tree andinfo
is a dictionary with annotation data.
-
get_tree
(path, info=None)[source]¶ Load a single tree.
path
is a relative path to a file (key in index.json file),info
is annotation data (value in index.json file).
-
check
()[source]¶ Check that items in storage are correct; print the problems found. Return the number of errors found.
-
get_fingerprint
(form)[source]¶ Return form fingerprint (a string that can be used for deduplication).
-
get_form_type_counts
(drop_duplicates=True, drop_na=True, simplify=False, verbose=True)[source]¶ Return a {formtype: count} collections.Counter
-
class
formasaurus.annotation.
AnnotationSchema
(types, types_inv, na_value, skip_value, simplify_map)¶ -
na_value
¶ Alias for field number 2
-
simplify_map
¶ Alias for field number 4
-
skip_value
¶ Alias for field number 3
-
types
¶ Alias for field number 0
-
types_inv
¶ Alias for field number 1
-
-
class
formasaurus.annotation.
FormAnnotation
[source]¶ Annotated HTML form
-
fields
¶ {“field name”: “field type”} dict.
-
fields_annotated
¶ True if form has fields and all fields are annotated.
-
fields_partially_annotated
¶ True when some fields are annotated and some are not annotated.
-
field_elems
¶ Return a list of lxml Elements for fields which are annotated. Fields are returned in in order they appear in form; only visible submittable fields are considered.
-
field_types
¶ A list of field types, in order they appear in form. Only visible submittable fields are considered.
-
field_types_full
¶ A list of long field type names, in order they appear in form. Only visible submittable fields are considered.
-
type_full
¶ Full form type name
-
-
formasaurus.annotation.
get_annotation_folds
(annotations, n_folds)[source]¶ Return (train_indices, test_indices) folds iterator. It is guaranteed forms from the same website can’t be both in train and test parts.
We must be careful when splitting the dataset into training and evaluation parts: forms from the same domain should be in the same “bin”. There could be several pages from the same domain, and these pages may have duplicate or similar forms (e.g. a search form on each page). If we put one such form in training dataset and another in evaluation dataset then the metrics will be too optimistic, and they can make us to choose wrong features/models. For example, train_test_split from scikit-learn shouldn’t be used here. To fix it LabelKFold from scikit-learn is used.
HTML Processing Utilities¶
HTML processing utilities
-
formasaurus.html.
remove_by_xpath
(tree, xpath)[source]¶ Remove all HTML elements which match a given XPath expression.
-
formasaurus.html.
load_html
(tree_or_html, base_url=None)[source]¶ Parse HTML data to a lxml tree.
tree_or_html
must be either unicode or utf8-encoded (even if original page declares a different encoding).If
tree_or_html
is not a string then it is returned as-is.
-
formasaurus.html.
get_cleaned_form_html
(form, human_readable=True)[source]¶ Return a cleaned up version of <form> HTML contents. If
human_readable
is True, HTML is cleaned to make source code more readable for humans; otherwise it is cleaned to make rendered form more safe to render.
-
formasaurus.html.
get_visible_fields
(form)[source]¶ Return visible form fields (the ones users should fill).
-
formasaurus.html.
get_fields_to_annotate
(form)[source]¶ Return fields which should be annotated:
- they should be visible to user, and
- they should have non-empty name (i.e. affect form submission result).
-
formasaurus.html.
escaped_with_field_highlighted
(form_html, field_name)[source]¶ Return escaped HTML source code suitable for displaying; fields with name==field_name are highlighted.
-
formasaurus.html.
highlight_fields
(html, field_name)[source]¶ Return HTML source code with all fields with name==field_name highlighted by adding
formasaurus-field-highlighted
CSS class.
Other Utilities¶
-
formasaurus.utils.
dependencies_string
()[source]¶ Return a string with versions of formasaurus, numpy, scipy and scikit-learn.
Saved scikit-learn models may be not compatible between different numpy/scipy/scikit-learn versions; a string returned by this function can be used as a part of file name.
-
formasaurus.utils.
add_scheme_if_missing
(url)[source]¶ >>> add_scheme_if_missing("example.org") 'http://example.org' >>> add_scheme_if_missing("https://example.org") 'https://example.org'
-
formasaurus.utils.
get_domain
(url)[source]¶ >>> get_domain('example.org') 'example' >>> get_domain('foo.example.co.uk') 'example'
-
formasaurus.utils.
inverse_mapping
(dct)[source]¶ Return reverse mapping:
>>> inverse_mapping({'x': 5}) {5: 'x'}
-
formasaurus.utils.
thresholded
(dct, threshold)[source]¶ Return dict
dct
without all values less than threshold.>>> thresholded({'foo': 0.5, 'bar': 0.1}, 0.5) {'foo': 0.5}
>>> thresholded({'foo': 0.5, 'bar': 0.1, 'baz': 1.0}, 0.6) {'baz': 1.0}
>>> dct = {'foo': 0.5, 'bar': 0.1, 'baz': 1.0, 'spam': 0.0} >>> thresholded(dct, 0.0) == dct True
-
formasaurus.utils.
download
(url)[source]¶ Download a web page from url, return its content as unicode.
-
formasaurus.utils.
response2unicode
(resp)[source]¶ Convert requests.Response body to unicode. Unlike
response.text
it handles <meta> tags in response content.
-
formasaurus.text.
tokenize
()¶ Tokenize text
-
formasaurus.text.
ngrams
(seq, min_n, max_n)[source]¶ Return min_n to max_n n-grams of elements from a given sequence.
IPython Annotation Widgets¶
IPython widgets for data annotation.
-
formasaurus.widgets.
MultiFormAnnotator
(annotations, annotate_fields=True, annotate_types=True, save_func=None)[source]¶ A widget with a paginator for annotating multiple forms.
-
formasaurus.widgets.
FormAnnotator
(ann, annotate_fields=True, annotate_types=True, max_fields=80)[source]¶ Widget for annotating a single HTML form.
-
formasaurus.widgets.
RawHtml
(html, field_name=None, max_height=500, **kwargs)[source]¶ Widget for displaying HTML form, optionally with a field highlighted
-
formasaurus.widgets.
HtmlCode
(form_html, field_name=None, max_height=None, **kwargs)[source]¶ Show HTML source code, optionally with a field highlighted