InPho Data Blog

InPho Data Blog
Welcome
Methods
Sided Layout
Procrustes
BEAGLE model
Word2Word: Similarity/Disparity Heat Map
Return to InPho Project
The
InPhO
DataBlog
Topic Modeling Tutorial at JCDL 2015
Posted on
May 14, 2015
by
Jaimie Murdock
No Comments
Join the HathiTrust Research Center (HTRC) and InPhO Project for a half-day tutorial on HathiTrust data access and topic modeling at JCDL 2015 in Knoxville, TN on Sunday, June 21, 2015, 9am-12pm!
Topic Exploration with the HTRC Data Capsule for Non-Consumptive Research
Organizers
: Jaimie Murdock, Jiaan Zeng and Robert McDonald
Abstract:
In this half-day tutorial, we will show 1) how the HathiTrust Research Center (HTRC) Data Capsule can be used for non-consumptive research over collection of texts and 2) how integrated tools for LDA topic modeling and visualization can be used to drive formulation of new research questions. Participants will be given an account in the HTRC Data Capsule and taught how to use the workset manager to create a corpus, and then use the VM’s secure mode to download texts and analyze their contents. [
tutorial paper
We draw your attention to the
astonishingly low half-day tutorial fees
Half-Day Tutorial/Workshop Early Registration
(by May 22!)
ACM/IEEE/SIG/ASIS&T Members – $70
Non-ACM/IEEE/SIG/ASIS&T Members – $95
ACM/IEEE/SIG/ASIS&T Student – $20
Non-member Student – $40
Half-Day Tutorial/Workshop Late/Onsite Registration
ACM/IEEE/SIG/ASIS&T Members – $95
Non-ACM/IEEE/SIG/ASIS&T Members – $120
ACM/IEEE/SIG/ASIS&T Student – $40
Non-member Student – $60
Hope to see you there!
Barry Smith APA talk on Formal Ontology for Philosophy
Posted on
April 4, 2015
by
Colin Allen
No Comments
At the 2015 APA Pacific meeting, Barry Smith presented a very thorough overview of ontologies, and some equally useful critiques of our InPhO approach. I’ll be following up to this post with responses to Barry’s comments, once the conference is over.
Towards an Ontology of Philosophy
The InPhO Topic Explorer
Posted on
August 10, 2014
by
Jaimie Murdock
No Comments
This summer, the InPhO Project launched the topic explorer. This visualization shows the similarity of articles in the Stanford Encyclopedia of Philosophy, as determined by LDA topic models.
InPhO Topic Explorer for the SEP entry on Animal Consciousness. Click to go to the interactive visualization
The color bands within each article’s row show the topic distribution within that article, and the relative sizes of each band indicates the weight of that topic in the article. The full width of each row indicates the similarity to the focus article. Each topic’s label and color is arbitrarily assigned, but is consistent across articles in the browser per topic.
Display options include topic normalization, alphabetical sort and topic sort. By normalizing topics, the full width of each bar expands and topic weights per document can be compared. By clicking a topic, the documents will reorder acoording to that topic’s weight and topic bars will reorder according to the topic weights in the highest weighted document.
By varying the number of topics, one can get a finer or coarser-grained analysis of the areas discussed in the articles. The visualization currently has 20, 40, 60, 80, 100, and 120 topic models for the
Stanford Encyclopedia of Philosophy
In early explorations, the visualization already highlights some interesting phenomena:
For central articles, such as
kant (40 topics)
, one finds that a single topic (topic 30) comprises much of the article. By increasing the number of topics, such as to
kant (120 topics)
, topic 77 now captures the “kant”-ness of the article, but several other components can now be explored. This shows the value of having multiple topic models.
For
creationism (120 topics)
, one can see that the particular blend of topics generating that article is truly an outlier, with the probability only just over .5 of generating the next closest document; compare this to the distribution of top articles related to
animal-consciousness (120 topics)
or
kant (120 topics)
. Can you find other outliers in the SEP?
The underlying dataset was generated using the
InPhO VSM module’s
LDA implementation. See
Wikipedia: Latent Dirichlet Allocation
for more on the LDA topic modeling approach or
“Probabilistic Topic Models” (Blei, 2012)
for a recent review.
Source code
and
issue tracking
are available at
GitHub
Please share any notes in the comments below!
Network visualization of LDA models through topic similarity
Posted on
December 15, 2013
by
Doori Lee
No Comments
In machine learning, a topic model is a type of statistical model for discovering the “topics” that occur in a corpus composed of documents. The
Latent Dirichlet Allocation (LDA) model
is one of the most commonly used topic models that represents the corpus as a network of topics.
I have been using the LDA model to see how specific philosophical topics relate to each other in a selection of 1315 volumes in
Hathi Trust library
. The LDA assumes that a given textual corpus has K number of topics and each document in the corpus is a mixture of topics. A “topic” is defined as a probability distribution over words and often represented as a list of most probable words in the topic. The number of topics is selected by the user when the model is trained. Thus the LDA model can be trained over the same textual corpus with different number of topics.
The number of topics is important in topic modeling as it determines the extent of a topic. Previous research states there is a natural number of topics for a given corpus
“On Finding the Natural Number of Topics with Latent Dirichlet Allocation : Some Observations” R. Arun et al. 2010
. However depending on the task, a small or large number of topics, in other words, broad or more specific topics, may be suitable.
By visualizing topic networks, we investigate the connections between the LDA models trained over the same corpus with different number of topics.
In this experiment, we compare the LDA models with different number of topics trained over the same corpus to investigate the relationships between models. We train the LDA model with different number of topics (K=20,40,160) and find similar topics between models using similarity functions from Indiana Philosophy Ontology project vector space model toolkit. For example, for every topic in 20, 40-topic model we find similar topics in 160 topics. The pair of models (e.g. 20 and 160-topic LDA models) is combined in a graph using
Gephi.
Below, the graphs show the network of topics by color-coded clusters based on modularity.
The graphs show topics from the K=20, 40 models as T# and topics from K=160 model as plain numbers. The graph distinguishes modules (clusters) with different colors and a module contains similar topics measured by how much internal structure there is within the module.
In Graph 1, each topic in the K=20 LDA model is mapped to 8 similar topics in the K=160 model. The 20 topics are grouped into 9 clusters. In Graph 2, each topic in the K=40 LDA model is mapped to 4 similar topics in K=160 model and the 40 topics are grouped into 15 clusters.
The tables below show a sample of topic clusters from each network graphs. The first two rows are topics from the K=20, 40 models (labeled T#) and the following rows are from the K=160 model. In these tables, a topic is labeled with an arbitrary number that is assigned to identify the topic and represented by 5 words that most commonly occur in the topic. The blue bold topics are the topics from the K=160 model that are similar to all topics in the topic cluster.
In Table 1, the related topics are regarding church, gods, and people. In Table 2, Topics 17 and 35 from the K=40 model shares 3 common topics from the K=160 model which relates to ‘social’, ‘individual’, ‘life’.
Through visualizing topic networks, we observe that various numbers of topics can be grouped into clusters by modularity or semantic similarity. Further research could compare clustering algorithms and the LDA models. For example, comparing the semantic closeness of N topic clusters from (K > N) LDA model with topics in (K = N) LDA model could help us obtain high quality topics.
STAT S-675 and S-475: follow up
Posted on
April 12, 2013
by
junotk
18 Comments
For S-675 & 475 students,
As there seems to be some confusion or unclearness about our LDA model & data, I will try to explain it briefly here.
More….
InPhO and Open Access
Posted on
March 11, 2013
by
Jaimie Murdock
5 Comments
Recent blog posts by Lisa Spiro at
Digital Scholarship in the Humanities
and by Stefan Heßbrüggen-Walter at
Early Modern Thought Online
have raised several interesting questions about the position of philosophy in general and the InPhO project in particular with respect to Digital Humanities.
However, we need to address some serious misrepresentations in the latter, especially the claims that we are committed to a “closed development model” and that our “reuse of the Stanford Encyclopedia of Philosophy is not based on liberal licensing, but apparently on special arrangements.” Also, it is incorrect of Heßbrüggen-Walter to say that, “Web scraping and data mining outside the project is prevented by Copyright. So it may not come as a surprise that the ontology developed by InPho is not licensed for reuse (though you can use an API to search it programmatically).”
More….
STAT S-675 and S-475
Posted on
March 8, 2013
by
Colin Allen
74 Comments
Students in
Prof. Michael Trosset
‘s Spring 2013
STAT S-675/475
class will be using InPhO data sets for their course projects and posting their analyses using the
STAT-S-675
tag. Welcome!
Using this tag, we will be posting some “open questions” with corresponding datasets to work on. Watch this space, and come back often.
Open questions and relevant data sets will be added as comments to this post…
The Shape of Philosophy (pt. 2)
Posted on
December 11, 2012
by
junotk
2 Comments
As I suggested in
the previous post
, one interesting use of
Iso
map
is its iterative application — as we focus on a particular region of the map, we obtain finer details and new coordinates emerge.
In the overall map reproduced at right, let us first focus on the area in the red rectangle. This area roughly contains analytic philosophers in the 20th century, and is replotted below (a zoomable version of the image can be reached by clicking on it):
More….
The Shape of Philosophy (pt.1)
Posted on
December 3, 2012
by
junotk
9 Comments
In my previous posts (pt.
) I gave some graphical representations of Beagle models on the Stanford Encyclopedia of Philosophy (SEP) and the Internet Encyclopedia of Philosophy (IEP). The shape of those two graphs were basically determined by the force-atlas algorithm in
Gephi
. Although they look cool, the coordinates produced by the algorithm have no intrinsic meanings — we thus cannot interpret relative locations of philosophers or the x/y-axes.
More….
French and English Philosophy (Part 3)
Posted on
November 18, 2012
by
bkievitk
2 Comments
In the previous two sections, we have looked at different visualizations of the fr.wiki (100 articles about philosophers), en.wiki (100 articles about the same philosophers), Stanford Encyclopedia of Philosophy (SEP) and Internet Enyclopedia of Philosophy (IEP) corpora.
Visualization is a powerful tool for understanding large data sets and helps to direct continued studies, but it is also important to validate our intuitive understanding of the visualizations with quantitative data.
More….
Archives
May 2015
April 2015
August 2014
December 2013
April 2013
March 2013
December 2012
November 2012
October 2012
Authors:
bkievitk
Colin Allen
Doori Lee
Jaimie Murdock
junotk