42.22. Analyzing COVID-19 papers#

In this challenge, we will continue with the topic of COVID pandemic, and focus on processing scientific papers on the subject. There is CORD-19 Dataset with more than 7000 (at the time of writing) papers on COVID, available with metadata and abstracts (and for about half of them there is also full text provided).

A full example of analyzing this dataset using Text Analytics for Health cognitive service is described in this blog post. We will discuss simplified version of this analysis.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

42.22.1. Getting the data#

First, we need get the metadata for CORD papers that we will be working with.

NOTE: We do not provide a copy of the dataset as part of this repository. You may first need to download the metadata.csv file from this dataset on Kaggle. Registration with Kaggle may be required. You may also download the dataset without registration from here, but it will include all full texts in addition to metadata file.

We will try to get the data directly from online source, however, if it fails, you need to download the data as described above. Also, it makes sense to download the data if you plan to experiment with it further, to save on waiting time.

NOTE that dataset is quite large, around 1 Gb in size, and the following line of code can take a long time to complete! (~5 mins)

df = pd.read_csv(
    "https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/data-science/metadata.csv"
)
df.head()

We will now convert publication date column to datetime, and plot the histogram to see the range of publication dates.

df['publish_time'] = pd.to_datetime(df['publish_time'])
df['publish_time'].hist()
plt.show()

Interestingly, there are coronavirus-related papers that date back to 1880!

42.22.2. Structured data extraction#

Let’s see what kind of information we can easily extract from abstracts. One thing we might be interested in is to see which treatment strategies exist, and how they evolved over time. To begin with, we can manually compile the list of possible medications used to treat COVID, and also the list of diagnoses. We then go over them and search corresponding terms in the abstracts of papers.

medications = [
    'hydroxychloroquine', 'chloroquine', 'tocilizumab', 'remdesivir', 'azithromycin', 
    'lopinavir', 'ritonavir', 'dexamethasone', 'heparin', 'favipiravir', 'methylprednisolone']
diagnosis = [
    'covid','sars','pneumonia','infection','diabetes','coronavirus','death'
]

for m in medications:
    print(f" + Processing medication: {m}")
    df[m] = df['abstract'].apply(lambda x: str(x).lower().count(' '+m))
    
for m in diagnosis:
    print(f" + Processing diagnosis: {m}")
    df[m] = df['abstract'].apply(lambda x: str(x).lower().count(' '+m))

We have added a bunch of columns to our dataframe that contain number of times a given medicine/diagnosis is present in the abstract.

Note that we add space to the beginning of the word when looking for a substring. If we do not do that, we might get wrong results, because chloroquine would also be found inside substring hydroxychloroquine. Also, we force conversion of abstracts column to str to get rid of an error - try removing str and see what happens.

To make working with data easier, we can extract the sub-frame with only medication counts, and compute the accumulated number of occurrences. This gives is the most popular medication:

dfm = df[medications]
dfm = dfm.sum().reset_index().rename(columns={ 'index' : 'Name', 0 : 'Count'})
dfm.sort_values('Count',ascending=False)

dfm.set_index('Name').plot(kind='bar')
plt.show()

42.22.3. Looking for trends in treatment strategy#

In the example above we have sumed all values, but we can also do the same on a monthly basis:

dfm = df[['publish_time']+medications].set_index('publish_time')
dfm = dfm[(dfm.index>="2020-01-01") & (dfm.index<="2021-07-31")]
dfmt = dfm.groupby([dfm.index.year,dfm.index.month]).sum()
dfmt

This gives us a good picture of treatment strategies. Let’s visualize it!

dfmt.plot()
plt.show()

An interesting observation is that we have huge spikes at two locations: January, 2020 and January, 2021. It is caused by the fact that some papers do not have a clearly specified data of publication, and they are specified as January of the respective year.

To make more sense of the data, let’s visualize just a few medicines. We will also “erase” data for January, and fill it in by some medium value, in order to make nicer plot:

meds = ['hydroxychloroquine','tocilizumab','favipiravir']
dfmt.loc[(2020,1)] = np.nan
dfmt.loc[(2021,1)] = np.nan
dfmt.fillna(method='pad',inplace=True)
fig, ax = plt.subplots(1,len(meds),figsize=(10,3))
for i,m in enumerate(meds):
    dfmt[m].plot(ax=ax[i])
    ax[i].set_title(m)
plt.show()

Observe how popularity of hydroxychloroquine was on the rise in the first few months, and then started to decline, while number of mentions of favipiravir shows stable rise. Another good way to visualize relative popularity is to use stack plot (or area plot in Pandas terminology):

dfmt.plot.area()
plt.show()

Even further, we can compute relative popularity in percents:

dfmtp = dfmt.iloc[:,:].apply(lambda x: x/x.sum(), axis=1)
dfmtp.plot.area()
plt.show()

42.22.4. Computing medicine-diagnosis correspondence#

One of the most interesting relationships we can look for is how different diagnoses are treated with different medicines. In order to visualize it, we need to compute co-occurence frequency map, which would show how many times two terms are mentioned in the same paper.

Such a map is essentially a 2D matrix, which is best represented by numpy array. We will compute this map by walking through all abstracts, and marking entities that occur there:

m = np.zeros((len(medications),len(diagnosis)))
for a in df['abstract']:
    x = str(a).lower()
    for i,d in enumerate(diagnosis):
        if ' '+d in x:
            for j,me in enumerate(medications):
                if ' '+me in x:
                    m[j,i] += 1

One of the ways to visualize this matrix is to draw a heatmap:

plt.imshow(m,interpolation='nearest',cmap='hot')
ax = plt.gca()
ax.set_yticks(range(len(medications))) 
ax.set_yticklabels(medications)
ax.set_xticks(range(len(diagnosis)))
ax.set_xticklabels(diagnosis,rotation=90)
plt.show()

However, even better visualization can be done using so-called Sankey diagram! matplotlib does not have built-in support for this diagram type, so we would have to use Plotly as described in this tutorial.

To make plotly sankey diagram, we need to build the following lists:

List all_nodes of all nodes in the graph, which will include both medications and diagnosis
List of source and target indices - those lists would show, which nodes go to the left, and which to the right part of the diagram
List of all links, each link consisting of:
- Source index in the all_nodes array
- Target index
- Value indicating strength of the link. This is exactly the value from our co-occurence matrix.
- Optionally color of the link. We will make an option to highlight some of the terms for clarity

Generic code to draw sankey diagram is structured as a separate sankey function, which takes two lists (source and target categories) and co-occurence matrix. It also allows us to specify the threshold, and omit all links that are weaker than that threshold - this makes the diagram a little bit less complex.

import plotly.graph_objects as go

def sankey(cat1, cat2, m, treshold=0, h1=[], h2=[]):
    all_nodes = cat1 + cat2
    source_indices = list(range(len(cat1)))
    target_indices = list(range(len(cat1),len(cat1)+len(cat2)))

    s, t, v, c = [], [], [], []
    for i in range(len(cat1)):
        for j in range(len(cat2)):
            if m[i,j]>treshold:
                s.append(i)
                t.append(len(cat1)+j)
                v.append(m[i,j])
                c.append('pink' if i in h1 or j in h2 else 'lightgray')

    fig = go.Figure(data=[go.Sankey(
        # Define nodes
        node = dict(
        pad = 40,
        thickness = 40,
        line = dict(color = "black", width = 1.0),
        label =  all_nodes),

        # Add links
        link = dict(
        source =  s,
        target =  t,
        value =  v,
        color = c
    ))])
    fig.show()

sankey(medications,diagnosis,m,500,h2=[0])

42.22.5. Conclusion#

You have seen that we can use quite simple methods to extract information from non-structured data sources, such as text. In this example, we have taken the existing list of medications, but it would be much more powerful to use natural language processing (NLP) techniques to perform entity extraction from text. In this blog post we describe how to use cloud services for entity extraction. Another option would be using Python NLP libraries such as NLTK - an approach for extracting information from text using NLTK is described here.

42.22.6. Challenge#

Continue to research the COVID paper data along the following lines:

Build co-occurrence matrix of different medications, and see which medications often occur together (i.e. mentioned in one abstract). You can modify the code for building co-occurrence matrix for medications and diagnoses.
Visualize this matrix using heatmap.
As a stretch goal, you may want to visualize the co-occurrence of medications using chord diagram. This library may help you draw a chord diagram.
As another stretch goal, try to extract dosages of different medications (such as 400mg in take 400mg of chloroquine daily) using regular expressions, and build dataframe that shows different dosages for different medications. Note: consider numeric values that are in close textual vicinity of the medicine name.

42.22.7. Acknowledgments#

Thanks to Microsoft for creating the open-source course Data Science for Beginners. It inspires the majority of the content in this chapter.

Ocademy Open Machine Learning Book

Analyzing COVID-19 papers

Contents