Subsections


Text mining

Text mining the Brede Database

Text mining with bag-of-words representation of abstracts in the Brede Database with hierarchical non-negative matrix factorization (NMF) and ``cluster bush'' visualization. This is a similar analysis as performed in [Nielsen et al., 2005,Nielsen et al., 2004] with the posterior cingulate.

load wobibs
M = brede_bib_bib2mat(B, 'type', 'abstract')
M = brede_mat_elimstop(M)
M = brede_mat_elimstop(M, 'filename', 'stop_meshcommon.txt')
M = brede_mat_elimstop(M, 'filename', 'stop_pubmed_neg1.txt')
M = brede_mat_elimsingle(M)
[WW,HH] = brede_mat_hnmf(M, 'info', 2);
figure,
brede_mat_plot_clusterbush(WW, HH, 'nodetextType', 'right')

One of the levels of the hierarchical NMF can be written to an HTML file and viewed in a web-browser.

brede_mat_2mat2html(WW{end}, HH{end})
web(['file://' pwd filesep 'Mat.html'], '-browser')
This will give the clusters of articles and words.

Text mining neuroimaging perirhinal studies

In a study we performed text mining on ``posterior cingulate'' articles [Nielsen et al., 2005,Nielsen et al., 2004]. Here we will perform the same analysis on ``perirhinal cortex''.

% A query string on human neuroimaging studies with perirhinal
q = [ '"perirhinal" AND ("positron emission tomography" OR ' ...
      '"magnetic resonance imaging") AND "human"[MH]' ];

% Query the pubmed 
pmids = brede_web_pubmed(q);

% Get full entries
B = brede_web_pmid(pmids, 'output', 'bib');

% Convert to matrix
M = brede_bib_bib2mat(B, 'type', 'abstract');

% Remove stop words and single instance words
M = brede_mat_elimstop(M, 'filename', 'stop_english1.txt')
M = brede_mat_elimstop(M, 'filename', 'stop_medline.txt')
M = brede_mat_elimstop(M, 'filename', 'stop_pubmed_neg1.txt')
M = brede_mat_elimstop(M, 'filename', 'stop_lobaranatomy.txt')
M = brede_mat_elimstop(M, 'filename', 'stop_meshcommon.txt')
M = brede_mat_elimstop(M, 'stopwords', { 'perirhinal' })
M = brede_mat_elimsingle(M)

% Find 'topics' with NMF
[WW,HH] = brede_mat_hnmf(M, 'info', 2)

% Plot overview
figure,
subplot(1,2,1)
brede_mat_plot_clusterbush(WW, HH, 'nodetexttype', 'left')
subplot(1,2,2)
brede_mat_plot_clusterbush(WW, HH, 'nodetexttype', 'right')

Text mining posterior cingulate abstracts in the Brede Database

The following code extract is for text mining the abstracts in the Brede Database associated with the posterior cingulate area. It is assumed that posterior cingulate locations in the Brede Database has been identified, see section 2.3. First all the ``bib'' structures that contain posterior cingulate coordinates are found. Then the abstracts are converted to a bag-of-words vectorial representation. Single instance words and stop words are removed before element-wise square root of the (abstract times word)-matrix. The matrix is factorized with the non-negative matrix factorization (NMF) algorithm. This is done hierarchically with increasing subspace. Finally the result of the NMF is plot in a ``cluster bush'' visualization. See also [Nielsen et al., 2005,Nielsen et al., 2004].

% Lq - The coocdinates of interest, ie, the posterior cingulate
Lq = Lpccr;

% Get articles that contain the coordinates of interest
wobib = unique(brede_struct_select(Lq, 'select', 'wobib'));
Bq = brede_struct_select(B, 'where', {'wobib', 'any(==)' wobib});

% Convert the abstract in the articles to a matrix
M = brede_bib_bib2mat(Bq, 'type', 'abstract');
M = brede_mat_elimsingle(M);
M = brede_mat_elimstop(M, 'filename', 'stop_english1.txt');
M = brede_mat_elimstop(M, 'filename', 'stop_lobaranatomy.txt');
M = brede_mat_elimstop(M, 'filename', 'stop_pubmed_neg1.txt');
M = brede_mat_elimstop(M, 'filename', 'stop_meshcommon.txt');
M = brede_mat_elimstop(M, 'filename', 'stop_medline.txt');
Msqrt = brede_mat_scale(M, 'type', 'sqrt');

% Hierarchical non-negative matrix factorization of the matrix
[WW,HH] = brede_mat_hnmf(Msqrt, 'info', 3, 'runs', 3);

% Make plots of the factorization
figure
brede_mat_plot_clusterbush(WW, HH, 'nodetexttype', 'leftwta')
figure
brede_mat_plot_clusterbush(WW, HH, 'nodetexttype', 'rightwta')

Finn Årup Nielsen 2012-09-27