I want to extract data from the wikilinks returned by the mwparserfromhell lib.
I want for instance to parse the following string:
[[File:Warszawa, ul. Freta 16 20170516 002.jpg|thumb|upright=1.18|[[Maria Skłodowska-Curie Museum|Birthplace]] of Marie Curie, at 16 Freta Street, in [[Warsaw]], [[Poland]].]]
If I split the string using the character |
, it doesn’t work as there is a link inside the description of the image that uses the |
as well: [[Maria Skłodowska-Curie Museum|Birthplace]]
.
I’m using regexp to first replace all links in the string before spliting it. It works (in this case) but it doesn’t feel clean (see code bellow). Is there a better way to extract information from such a string?
import re
wiki_code = "[[File:Warszawa, ul. Freta 16 20170516 002.jpg|thumb|upright=1.18|[[Maria Skłodowska-Curie Museum|Birthplace]] of Marie Curie, at 16 Freta Street, in [[Warsaw]], [[Poland]].]]"
# Remove [[File: at the begining of the string
prefix = "[[File:"
if (wiki_code.startswith(prefix)):
wiki_code = wiki_code[len(prefix):]
# Remove ]] at the end of the string
suffix = "]]"
if (wiki_code.endswith(suffix)):
wiki_code = wiki_code[:-len(suffix)]
# Replace links with their
link_pattern = re.compile(r'\[\[.*?\]\]')
matches = link_pattern.findall(wiki_code)
for match in matches:
content = match[2:-2]
arr = content.split("|")
label = arr[-1]
wiki_code = wiki_code.replace(match, label)
print(wiki_code.split("|"))
The links returned by .filter_wikilinks()
are instances of the Wikilink
class, which have title
and text
properties.
title
returns the title of the link:File:Warszawa, ul. Freta 16 20170516 002.jpg
text
returns the rest of the link:thumb|upright=1.18|[[Maria Skłodowska-Curie Museum|Birthplace]] of Marie Curie, at 16 Freta Street, in [[Warsaw]], [[Poland]].
These are returned as Wikicode
objects.
Since the actual text is always the last fragment, first you need to find other fragments with the following regex:
([^\[\]|]*\|)+
(
)
: Group of[^\[\]|]*
: 0 or more characters that is not square brackets or pipes\|
: a literal pipe
+
: 1 or more
Everything else from the ending index of the last match until the end of the string is the last fragment.
>>> import mwparserfromhell
>>> import re
>>> wikitext = mwparserfromhell.parse('[[File:Warszawa, ul. Freta 16 20170516 002.jpg|thumb|upright=1.18|[[Maria Skłodowska-Curie Museum|Birthplace]] of Marie Curie, at 16 Freta Street, in [[Warsaw]], [[Poland]].]]')
>>> image_link = wikitext.filter_wikilinks()[0]
>>> image_link
'[[File:Warszawa, ul. Freta 16 20170516 002.jpg|thumb|upright=1.18|[[Maria Skłodowska-Curie Museum|Birthplace]] of Marie Curie, at 16 Freta Street, in [[Warsaw]], [[Poland]].]]'
>>> image_link.title
'File:Warszawa, ul. Freta 16 20170516 002.jpg'
>>> text = str(image_link.text)
>>> text
'thumb|upright=1.18|[[Maria Skłodowska-Curie Museum|Birthplace]] of Marie Curie, at 16 Freta Street, in [[Warsaw]], [[Poland]].'
>>> other_fragments = re.match(r'([^\[\]|]*\|)+', text)
>>> other_fragments
<re.Match object; span=(0, 19), match="thumb|upright=1.18|">
>>> other_fragments.span(0)[1]
19
>>> text[19:]
'[[Maria Skłodowska-Curie Museum|Birthplace]] of Marie Curie, at 16 Freta Street, in [[Warsaw]], [[Poland]].'