Wikifier documentation
To call the JSI Wikifier, send a HTTP GET request to a URL of the following form:
http://www.wikifier.org/annotate-article?text=...&lang=...&...
The server is currently still in development so it is occasionally down.
The following parameters are supported:
-
text
: the text of the document that you want to annotate. Use UTF-8 and %-encoding for non-ASCII characters (e.g.text=Beyonc%C3%A9
). -
lang
: the ISO-639 code of the language of the document. Both 2- and 3-letter codes are supported (e.g. en or eng for English, sl or slv for Slovenian, etc.). See also: list of all the languages currently supported by the JSI Wikifier. -
wikiDataClasses
: should be true or false, determines whether to include, for each annotation, a list of wikidata (concept ID, concept name) pairs for all classes to which this concept belongs (directly or indirectly). -
wikiDataClassIds
: likewikiDataClasses
, but generates a list of concept IDs only (which makes the resulting JSON output shorter). -
support
: should be true or false, determines whether to include, for each annotation, a list of subranges in the input document that support this particular annotation. -
ranges
: should be true or false, determines whether to include, for each subrange in the document that looks like a possible mention of a concept, a list of all candidate annotations for that subrange. This will significantly increase the size of the resulting JSON output, so it should only be used if there's a strong need for this data. -
includeCosines
: should be true or false, determines whether to include, for each annotation, the cosine similarity between the input document and the Wikipedia page corresponding to that annotation. Currently the cosine similarities are provided for informational purposes only and are not used in choosing the annotations. Thus, you should set this to false to conserve some CPU time if you don't need the cosines for your application. -
maxMentionEntropy
: set this to a real number x to cause all highly ambiguous mentions to be ignored (i.e. they will contribute no candidate annotations into the process). The heuristic used is to ignore mentions where H(link target | anchor text = this mention) > x. (Default value: −1, which disables this heuristic.) -
maxTargetsPerMention
: set this to an integer x to use only the most frequent x candidate annotations for each mention (default value: 20). Note that some mentions appear as the anchor text of links to many different pages in the Wikipedia, so disabling this heuristic (by setting x = −1) can increase the number of candidate annotations significantly and make the annotation process slower. -
pageRankSqThreshold
: set this to an real number x to prune the annotations on the basis of their pagerank score. The Wikifier will compute the sum of squares of all the annotations (e.g. S), sort the annotations by decreasing order of pagerank, and keep as many top-ranking annotations as are needed to bring the sum of their pagerank squares to S · x. Thus, a lower x means less annotations. (Default value: −1, which disables this pruning mechanism.)
Output format
The Wikifier returns a JSON reponse of the following form:
{ "annotations": [ ... ], "spaces":["", " ", " ", "."], "words":["New", "York", "City"], "ranges": [ ... ] }
The spaces
and words
arrays show how the input document has been split into words.
It is always the case that spaces
has exactly 1 more element than words
and that
concatenating spaces[0] + words[0] + spaces[1] + words[1] + ... + spaces[N-1] + words[N-1] + spaces[N]
(where N
is the length of words
) is exactly equal to the input document (the one that
was passed as the &text=...
parameter).
annotations
is an array of objects of the following form:
{ "title":"New York City", "url":"http:\/\/en.wikipedia.org\/wiki\/New_York_City", "lang":"en", "pageRank":0.102831, "cosine":0.662925, "enTitle":"New York City", "enUrl":"http:\/\/en.wikipedia.org\/wiki\/New_York_City", "wikiDataClasses": [ {"itemId":"Q515", "enLabel":"city"}, {"itemId":"Q1549591", "enLabel":"big city"}, ... ], "wikiDataClassIds": ["Q515", "Q1549591", ...], "dbPediaTypes":["City", "Settlement", "PopulatedPlace", ...], "dbPediaIri":"http:\/\/dbpedia.org\/resource\/New_York_City", "supportLen":2.000000, "support": [ {"wFrom":0.000000, "wTo":1.000000, "pMentionGivenSurface":0.122591, "pageRank":0.018634}, {"wFrom":0.000000, "wTo":2.000000, "pMentionGivenSurface":0.483354, "pageRank":0.073469} ] }
-
url
is the URL of the Wikipedia page corresponding to this annotation, andtitle
is its title; -
lang
is the language code of the Wikipedia from which this annotation is taken (currently this is always the language of the input document); -
enUrl
andenTitle
refer to the equivalent page of the English-language Wikipedia, if available (this is more useful whenlang != "en"
); -
wikiDataClasses
andwikiDataClassIds
are lists of the classes to which this concept belongs according to WikiData (using theinstanceOf
property, and then all their ancestors that can be reached with thesubclassOf
property; -
dbPediaIri
is (one of) the DBPedia IRIs corresponding to this annotation, anddbPediaTypes
are types to which this DBPedia IRI is connected via thehttp://www.w3.org/1999/02/22-rdf-syntax-ns#type
property; -
support
is an array of all the subranges in the document that support this particular annotation; for each such subrange,wFrom
andwTo
are the indices (intowords
) of the first and last word of the subrange;pageRank
is the pagerank of this subrange (not necessarily a very useful value for the user), andpMentionGivenSurface
is the probability that, when a link appears in the Wikipedia with this particular subrange as its anchor text, it points to the Wikipedia page corresponding to the current annotation.
ranges
is an array of objects of the following form:
{ "wFrom": 0, "wTo": 1, "pageRank":0.018634, "pMentionGivenSurface":0.122591, "candidates": [ {"title":"New York", "url":"http:\/\/en.wikipedia.org\/wiki\/New_York", "cosine":0.578839, "linkCount":63626, "pageRank":0.049533}, {"title":"New York City", "url":"http:\/\/en.wikipedia.org\/wiki\/New_York_City", "cosine":0.662925, "linkCount":11589, "pageRank":0.102831}, {"title":"New York (magazine)", "url":"http:\/\/en.wikipedia.org\/wiki\/New_York_(magazine)", "cosine":0.431092, "linkCount":2159, "pageRank":0.030795}, ... ] }
The first four members are the same as in support
; in this particular example,
we have wFrom
= 0 and wTo
= 1, so this object refers to the phrase "New York".
The candidates
array is a list of all the pages in the Wikipedia that are pointed to
by links (from other pages in the Wikipedia) whose anchor text is the same as this phrase; for each
such page, we have an object giving its title, Wikipedia URL, cosine similarity with the input document,
number of links with this anchor text pointing to this particular page, and the pagerank score of this
candidate annotation. For phrases that generate too many candidates, some of these candidates might
not participate in the pagerank computation; in that case pageRank is shown as -1 instead.
Sample code in Python 3
Note: the following sample uses POST; if your input document is short, you can also use GET instead.
import urllib.parse, urllib.request, json def CallWikifier(text, lang="en", threshold=0.8): # Prepare the URL. data = urllib.parse.urlencode([ ("text", text), ("lang", lang), ("pageRankSqThreshold", "%g" % threshold), ("wikiDataClasses", "true"), ("wikiDataClassIds", "false"), ("support", "true"), ("ranges", "false"), ("includeCosines", "false"), ("maxMentionEntropy", "3") ]) url = "http://www.wikifier.org/annotate-article" # Call the Wikifier and read the response. req = urllib.request.Request(url, data=data.encode("utf8"), method="POST") with urllib.request.urlopen(req, timeout = 60) as f: response = f.read() response = json.loads(response.decode("utf8")) # Output the annotations. for annotation in response["annotations"]: print("%s (%s)" % (annotation["title"], annotation["url"])) CallWikifier("Syria's foreign minister has said Damascus is ready " + "to offer a prisoner exchange with rebels.")