Wikipedia datamining

Tuesday 2 July 2019

Wikipedia datamining

There are several online sites that can be data-mined to reveal really nice trends, top-10s and topdown summaries. Twitter is the archetype site for this, thanks to hashtags making an easy job for anyone wanting to investigate trends. I prefer Reddit for datamining specific trends as it powered by folk having arguments on topics they are passionate about as opposed to ideas of celebrities, corporate spokespeople and ФСБ agents. eBay is also fun as it reveals what people are willing to pay for things. But the best source of data, even for other datasets, is Wikipedia. Not only to read up on things, but also to get data for things within a given "category".

Getting lists of things

Often I find myself needing a list of things of given type for some hobby project or other. Say all airplanes, Greek heroes, Roman emperors or stars.
All Wikipedia pages fall under number of categories, listed at the bottom of the page. These are a great way to get all the articles of that grouping. For some things there are probably good JSON files or CSV files, but for others it's a lot harder. Stars fall under this category with a great dataset (HYG database), but for illustrative purposes I'll use them anyway as it's a simple category.

Exploring a category

This is not what we were looking for...
For example, let's take as a random starting point Tabby's star (the star with strange properties that was hypothesised to have a Dyson sphere, before astrophysics found out it was something "extremely interesting", but way less than alien megastructures) —not obscure, but not mainstream. It is tagged with several categories, including Category:F-type main-sequence stars, which is a child category of Category:Main-sequence stars, which in turn is a child of Category:Stars by luminosity class. The more one goes up the broader it gets, except that child categories you probably don't want get in there. Say Category:Stars contains Category:Coats of arms with stars, so if I scraped from there I'd get crap like the Seal of Oklahoma. So it is best to explore a bit the category architecture. To avoid this rather than fighting it afterwards.

Here is a class I wrote to get all pages recursively down categories unless a child category has an optional forbidden keyword. It also parser the content if a method is supplied and gets pageviews.

Exploring the infobox

Articles of a given subject with all have a common infobox on the right hand side. In the case of a star, this is a star box. The wikimarkup code (click edit, but change the query string value from edit to raw, e.g. &action=raw) for a different star, Δ Cephei, which is also a corner case:

{{about||the variable star type|Delta Cephei variable|the general class of variable stars|Cepheid variable}}
{{Starbox begin
 | name = Delta Cephei
}}
{{Starbox image
 | image=
{{Location mark
|image=Cepheus constellation map.svg
|float=center
|alt=
|label=
|position=right
|width=280
|mark=Red circle.svg
|mark_width=10
|mark_link=δ Cephei
|x=613|y=1062
}}
|caption=Location of δ Cep (circled)
}}
{{Starbox observe 2s
 | component1 = δ Cep A
 | epoch      = [[J2000.0]]
 | constell   = [[Cepheus (constellation)|Cepheus]]
 | ra1        = {{RA|22|29|10.26502}}<ref name=aaa474_2_653/>
 | dec1       = {{DEC|+58|24|54.7139}}<ref name=aaa474_2_653/>
 | appmag_v1  = 4.07 (3.48–4.37) / 7.5
 | component2 = δ Cep C
 | ra2 = {{RA|22|29|09.248}}<ref name=aaa474_2_653/> <!--Right Ascension of the third component-->
 | dec2 = {{DEC|+58|24|14.76}}<ref name=aaa474_2_653/> <!--Declination of the third component-->
 | appmag_v2 = 6.3 <!--Apparent magnitude of the third component (Johnson-Cousins V system)-->}}
{{Starbox character
 | class = F5Ib-G1Ib<ref name=engle>{{Cite journal | doi = 10.1088/0004-637X/794/1/80| title = THE SECRET LIVES OF CEPHEIDS: EVOLUTIONARY CHANGES AND PULSATION-INDUCED SHOCK HEATING IN THE PROTOTYPE CLASSICAL CEPHEID δ Cep| journal = The Astrophysical Journal| volume = 794| issue = 1| pages = 80| year = 2014| last1 = Engle | first1 = S. G. | last2 = Guinan | first2 = E. F. | last3 = Harper | first3 = G. M. | last4 = Neilson | first4 = H. R. | last5 = Evans | first5 = N. R. |arxiv = 1409.8628 |bibcode = 2014ApJ...794...80E }}</ref> + B7-8<ref name=evans>{{cite journal | doi = 10.1088/0004-6256/146/4/93 | title= BINARY CEPHEIDS: SEPARATIONS AND MASS RATIOS IN 5 M ☉ BINARIES | journal=The Astronomical Journal | date=2013 | volume=146 | issue=4 | pages=93 | first=Nancy Remage | last=Evans|arxiv = 1307.7123 |bibcode = 2013AJ....146...93E }}</ref>
 | r-i         = <!--R-I color-->
 | v-r         = <!--V-R color-->
 | b-v         = 0.60
 | u-b         = 0.36
 | variable    = [[Cepheid variable|Cepheid]]
}}
{{Starbox astrometry
 | radial_v    = -16.8<ref name=anderson15/>
 | prop_mo_ra  = +15.35<ref name=aaa474_2_653/>
 | prop_mo_dec = +3.52<ref name=aaa474_2_653/>
 | parallax    = 3.77
 | p_error     = 0.16
 | parallax_footnote = <ref name=aaa474_2_653/>
 | dist_ly     = {{nowrap|887 ± 26}} 
 | dist_pc     = {{nowrap|272 ± 8}}<ref name=benedict02/><ref name=majaess2012/>
 | absmag_v    = {{nowrap|–3.47 ± 0.10}} {{nowrap|(–3.94 - –3.05)}}<ref name=benedict02/>
}}
{{Starbox detail
 | component1  = δ Cep A
 | mass        = {{nowrap|4.5 ± 0.3}}<ref name=apj744_1_53/>
 | radius      = 44.5<ref name=apj744_1_53/>
 | gravity     = 
 | luminosity  = ∼2000<ref name=apj744_1_53/>
 | temperature = 5,500–6,800<ref name=moore/>
 | metal_fe    = +0.08<ref name=aaa488_1_25/>
 | rotational_velocity = 9<ref name=ciako1970/>
 | age_myr     = ~100
 | component2  = δ Cep B<ref name=anderson15/>
 | mass2       = 0.2 - 1.2
<!-- Unfortunately, the Starbox template will only show two components. Since the physical association between δ Cep B and A is much clearer than between C and A, it makes sense to keep B visible for now. Ideally, one should show C as well.-->
 | component3  = δ Cep C
 | luminosity3 = 500
 | temperature3 = 8,800<ref name=apj744_1_53/>
}}
{{Starbox orbit
 | reference   = <ref name=anderson15/>
 | name        = δ Cep B
 | primary     = δ Cep A
 | period      = 6.03
 | eccentricity= 0.647
 | k1          = 1.509 ± 0.2
 | name3       = Delta Cephei C
 | period3     = <!-- Previously listed value of 500yrs is much too small, although no good estimate available -->
 | axis3       = <!--Semimajor axis (in arcseconds)-->
 | axis_unitless3 = 12,000 [[Astronomical unit|AU]]
 | eccentricity3 = <!--Eccentricity-->
 | inclination3 = <!--Inclination (in degrees)-->
 | node3        = <!--Longitude of node (in degrees)-->
 | periastron3  = <!--Periastron epoch-->
 | periarg3     = <!--Argument of periastron (in degrees)-->
 | mass3       = <!-- Listed figure of 54 solar masses highly dubious, so I'm hiding it. -Gnomon -->
}}
{{Starbox catalog
 | names = 27 Cephei, [[Bonner Durchmusterung|BD]]+57 2548, [[Fifth Fundamental Catalogue|FK5]] 847, [[Henry Draper catalogue|HD]] 213306, [[Hipparcos catalogue|HIP]] 110991, [[Harvard Revised catalogue|HR]] 8571, [[Smithsonian Astrophysical Observatory Star Catalog|SAO]] 34508.
}}
{{Starbox reference
 | Simbad = delta+Cep
}}
{{Starbox end}}
As we can see there is a lot of stuff that makes data mining hard, e.g. <ref>. But there is also a large amount of data for us to plunder. In wikimarkup, a double curly bracket calls a template with arguments separate by pipes e.g. {{RA|22|29|10.26502}}. Starbox is therefore a series of template that are formatted nice by bots for readibility. These are browser searchable simply with "template:". Let's take, Template:Starbox_astrometry, here we find the documentation for it, along with all possible options —do note that a small amount of pages have mispelt options that will not render and many options will have values not as regular numbers but in a weird array of formats.

  • a number
  • a number with thousands comma
  • a number with European thousands point (very rare)
  • the val templatee.g. gravity
  • nowrap template used with a ± symbol (incorrect, but very common) 
  • a number with one of different units —ouch
  • a val/no wrap template with one of different units outside the template, e.g. rotation,
  • a val template with one of different units within the template with the argument ul= (full list of what units mean
  • a range with a hyphen-minus or en-dash (-, –)
  • Some human readable value like A: 2.2 B: 202
There is a Python package called wikitextparser that make parsing wikimarkup easier. Here is a piece of code to convert a given infobox to a dictionary. Note that I have not extracted the numbers from the values, but left them: in data munging it is often best to save everything as certain weird formats may be pervasive.

Pageviews

Lastly, a really nice source of data is not from Wikipedia, but it's metadata. Namely, it's pageviews. That is how popular an article is. This can tell you a lot about the articles in question and gives the most interesting results. In fact, if you consider that proximity and apparent magnitude are strong drivers of popularity, the first star used as an example (the not-so-full-of-aliens one) is a total outlier! Do note that the API is different, namely Wikipedia API used above is https://en.wikipedia.org/w/api.php, while for pageviews it's https://wikimedia.org/api/.

Example

Here is an example of a figure from data from Wikipedia. The number of planes produced, the page views and the year. (Github: https://github.com/matteoferla/Wikipedia_planes, static html version of Jupyter notebook: http://users.ox.ac.uk/~bioc1451/planewiki_notebook.html)

No comments:

Post a Comment