.. _how-it-works: How It Works ============ Formasaurus uses two separate ML models for form type detection and for field type detection. Field type detector uses form type detection results to improve the quality. The model is trained on 1000+ annotated web forms - check `data`_ folder in Formasaurus repository. Most pages to annotate were selected randomly from `Alexa Top 1M `_ websites. Form Type Detection ------------------- To detect HTML form types Formasaurus takes a `
` element and uses a linear classifier (`Logistic Regression`_) to choose its type from a predefined set of types. Features include: * counts of form elements of different types, * whether a form is POST or GET, * text on submit buttons, * names and char ngrams of CSS classes and IDs, * input labels, * presence of certain substrings in URLs, * etc. See `Form Type Detection.ipynb`_ IPython notebook for more detailed description. .. _Logistic Regression: https://en.wikipedia.org/wiki/Logistic_regression .. _Form Type Detection.ipynb: https://github.com/TeamHG-Memex/Formasaurus/blob/master/notebooks/Form%20Type%20Detection.ipynb .. _data: https://github.com/TeamHG-Memex/Formasaurus/tree/master/formasaurus/data Field Type Detection -------------------- To detect form field types Formasaurus uses `Conditional Random Field`_ (CRF) model. All fields in an HTML form is a sequence where order matters; CRF allows to take field order in account. Features include * form type predicted by a form type detector, * field tag name, * field value, * text before and after field, * field CSS class and ID, * text of field