pattern
Pattern is a web mining module for the Python programming language.
It bundles tools for data retrieval (Google + Twitter + Wikipedia API, web spider, HTML DOM parser), text analysis (rule-based shallow parser, WordNet interface, syntactical + semantical n-gram search algorithm, tf-idf + cosine similarity + LSA metrics), clustering and classification (k-means, k-NN, SVM), and data visualization (graph networks).
The module is bundled with 30+ example scripts and 350+ unit tests.
Download
![]() |
Pattern 2.5 | download (19MB)
Reference: De Smedt, T. & Daelemans, W. (2012). |
ModulesHelper modules Command-line |
Contribute |
Installation
Pattern is written for Python 2.5+ (no support for Python 3 yet). The module has no external dependencies except when using LSA in the vector module, which requires NumPy (installed by default on Mac OS X).
To install it so that the module is available in all your scripts, open a terminal and do:
> cd pattern-2.5 > python setup.py install
If you have pip, you can automatically download and install from the PyPi repository:
> pip install pattern
If none of the above works, you can make Python aware of the module in three ways:
- Put the pattern subfolder in the same folder as your script.
- Put the pattern subfolder in the standard location for modules so it is available to all scripts:
c:\python25\Lib\site-packages\ (Windows),
/Library/Python/2.5/site-packages/ (Mac OS X),
/usr/lib/python2.5/site-packages/ (Unix). - Add the location of the module to sys.path in your script, before importing it:
>>> MODULE = '/users/tom/desktop/pattern' >>> import sys; if MODULE not in sys.path: sys.path.append(MODULE) >>> from pattern.en import parse, Sentence
Quick overview
pattern.web
Module pattern.web is a web toolkit that bundles various API's (Google, Gmail, Bing, Twitter, Facebook, Wikipedia, Flickr) with a robust HTML parser and a crawler. The module's purpose is to retrieve online content in an easy-to-use, uniform way.
>>> from pattern.web import Twitter, plaintext >>> for tweet in Twitter().search('"more important than"', cached=False): >>> print plaintext(tweet.description) 'The mobile web is more important than mobile apps.' 'Start slowly, direction is more important than speed.' 'Imagination is more important than knowledge. - Albert Einstein' ...
pattern.en
Module pattern.en is a natural language processing (NLP) toolkit for English. It is based on regular expressions, meaning that it is fast but on occasion also prone to incorrect results (see MBSP for a robust approach compatible with Pattern). It has functionality for word inflection (for example: verb conjugation and noun pluralization), a Python interface to the WordNet database, and a Brill-based shallow parser. A shallow parser analyzes a sentence and identifies the constituents (nouns, verbs, etc.).
>>> from pattern.en import parse, pprint >>> s = 'The mobile web is more important than mobile apps.' >>> s = parse(s, relations=True, lemmata=True) >>> pprint(s) WORD TAG CHUNK ROLE ID PNP LEMMA The DT NP SBJ 1 - the mobile JJ NP ^ SBJ 1 - mobile web NN NP ^ SBJ 1 - web is VBZ VP - 1 - be more RBR ADJP - - - more important JJ ADJP ^ - - - important than IN PP - - PNP than mobile JJ NP - - PNP mobile apps NNS NP ^ - - PNP app
Note how the sentence has been annotated with various tags, discerning for example nouns (NN), adjectives (JJ), determiners (DT), verbs (VB), noun phrases (NP), sentence subject (SBJ), and a prepositional noun phrase (PNP). A parse tree is a Python structure of related objects of the parsed text:
>>> from pattern.en import parsetree >>> s = 'The mobile web is more important than mobile apps.' >>> t = parsetree(s) >>> for chunk in t.sentences[0].chunks: >>> for word in chunk.words: >>> print word, >>> print Word(u'The/DT') Word(u'mobile/JJ') Word(u'web/NN') Word(u'is/VBZ') Word(u'more/RBR') Word(u'important/JJ') Word(u'than/IN') Word(u'mobile/JJ') Word(u'apps/NNS')
Parsers for Spanish, German and Dutch are also available: pattern.es | pattern.de | pattern.nl.
pattern.search
Module pattern.search contains an elegant search algorithm to retrieve sequences of words (called n-grams) from a parsed sentence.
>>> from pattern.en import parsetree >>> from pattern.search import search >>> s = 'The mobile web is more important than mobile apps.' >>> t = parsetree(s, relations=True, lemmata=True) >>> >>> for match in search('NP be (RB)+ important than NP', t): >>> print match.constituents()[-1], "=>", \ >>> match.constituents()[0] Chunk('mobile apps/NP') => Chunk('The mobile web/NP-SBJ-1')
Observe the given search pattern: "NP be (RB)+ important than NP". It means: any noun phrase followed by the verb to be (is, was, ...), followed by zero or more adverbs (e.g. much, more), followed by the words important than, followed by any noun phrase. It will match any of the following variations:
- "the mobile web will be much more important than mobile apps"
- "mobile apps are less important than the mobile web"
- "a good blog is more important than a fancy facebook page", etc.
pattern.vector
>>> from pattern.web import Twitter >>> from pattern.en import tag >>> from pattern.vector import kNN, count >>> >>> knn = kNN() >>> >>> for i in range(1, 10): >>> for tweet in Twitter().search('#win OR #fail', start=i, count=100): >>> s = tweet.description.lower() >>> p = '#win' in s and 'WIN' or 'FAIL' >>> v = tag(s) >>> v = [word for word, pos in v if pos == 'JJ'] # JJ = adjective >>> v = count(v) >>> if len(v) > 0: >>> knn.train(v, type=p) >>> >>> print knn.classify('sweet') >>> print knn.classify('stupid') 'WIN' 'FAIL'
pattern.graph
Module pattern.graph provides a data structure to represent relationships between nodes (e.g. words, concepts, entities ...) The relative importance (or centrality) of each node can then be calculated. Graphs can be exported as an interactive web page using the HTML <canvas> element (demo).
The screenshot shows an exported graph with nodes pointing to more important nodes (data mined from Bing). Nodes with a lot of "traffic" are marked with a shadow (money, football, life), important nodes are marked in blue (experience, I, nothing, money).
Note: The nothing result could use some extra post-processing, e.g. in: nothing is more important than life, the word life is important, not the word nothing.
Source code:
>>> from pattern.web import Bing, plaintext >>> from pattern.en import parsetree >>> from pattern.search import search >>> from pattern.graph import Graph, Node, Edge, export >>> >>> g = Graph() >>> for i in range(10): >>> for r in Bing().search('"more important than"', start=i+1, count=50): >>> s = r.description.lower() >>> s = plaintext(s) >>> t = parsetree(s) >>> p = '{NP} (VP) more important than {NP}' >>> for m in search(p, t): >>> a = m.group(1).string # Left NP. >>> b = m.group(2).string # Right NP. >>> if a not in g: >>> g.add_node(a, radius=5, stroke=(0,0,0,0.8)) >>> if b not in g: >>> g.add_node(b, radius=5, stroke=(0,0,0,0.8)) >>> g.add_edge(g[b], g[a], stroke=(0,0,0,0.6)) >>> >>> g = g.split()[0] # Largest subgraph. >>> >>> for n in g.sorted()[:40]: # Sorted by Node.weight. >>> n.fill = (0.0, 0.5, 1.0, 0.7 * n.weight) >>> >>> export(g, 'test', directed=True, weighted=0.6, distance=6)
Examples & experiments
![]() |
Belgian elections, June 13, 2010 – Twitter opinion mining After the fall of the previous government, the New Flemish Alliance emerged as the plurality party with 27 seats. In the week before the elections we analyzed 7,600 tweets that mentioned the name of a Belgian politician. read more |
![]() |
November 2010 – March 2011, 100 days of web mining During a 100-day period, we collected 6,400 Google News items and 70,000 tweets with the goal of finding a correlation between important news items and personal opinions on Twitter. What we got was profanity, mostly. read more |