{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Analyzing text about Data Science\n", "\n", "In this example, let's do a simple exercise that covers all steps of a traditional data science process. You do not have to write any code, you can just click on the cells below to execute them and observe the result. As a challenge, you are encouraged to try this code out with different data. \n", "\n", "## Goal\n", "\n", "In this section, we have been discussing different concepts related to Data Science. Let's try to discover more related concepts by doing some **text mining**. We will start with a text about Data Science, extract keywords from it, and then try to visualize the result.\n", "\n", "As a text, I will use the page on Data Science from Wikipedia:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import ipytest\n", "import unittest\n", "import pytest\n", "\n", "ipytest.autoconfig()\n", "\n", "url = 'https://en.wikipedia.org/wiki/Data_science'" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Getting the data\n", "\n", "First step in every data science process is getting the data. We will use `requests` library to do that as below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests\n", "\n", "def http_get(url):\n", " return requests.____.____\n", "\n", "text = http_get(url)\n", "print(text[:1000])" ] }, { "cell_type": "markdown", "metadata": { "jp-MarkdownHeadingCollapsed": true, "tags": [] }, "source": [ "

Check result by executing below... 📝

\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "jupyter": { "source_hidden": true }, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "%%ipytest -qq\n", "\n", "test_url = \"https://bing.com\"\n", "\n", "def test_http_get_happy_case():\n", " # act\n", " actual_result = http_get(test_url)\n", "\n", " # assert\n", " assert actual_result.startswith(\"\")\n", "\n", "def test_http_get_with_none_url():\n", " # act & assert\n", " with pytest.raises(Exception):\n", " http_get(None)\n", "\n", "def test_http_get_with_invalid_url():\n", " # act & assert\n", " with pytest.raises(Exception):\n", " actual_result = http_get(\"https://notexisting\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "tags": [ "hide-input" ] }, "source": [ "

\n", " \n", "

👩‍💻 Hint

\n", "\n", "You can consider to use the get method of requests library to do that.\n", "\n", "

\n", "\n", "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Transforming the data\n", "\n", "The next step is to convert the data into the form suitable for processing. In our case, we have downloaded HTML source code from the page, and we need to convert it into plain text.\n", "\n", "There are many ways this can be done. We will use the simplest built-in [HTMLParser](https://docs.python.org/3/library/html.parser.html) object from Python. We need to subclass the `HTMLParser` class and define the code that will collect all text inside HTML tags, except `\n", "

\n", " {test_content}\n", "

\n", " \n", " \"\"\")\n", " actual_result = parser.res\n", " \n", " # assert\n", " assert actual_result.strip() == test_content\n", " \n", " def test_feed_with_empty_content(self):\n", " # assign\n", " parser = MyHTMLParser()\n", " \n", " # act \n", " parser.feed(\"\")\n", " actual_result = parser.res\n", " \n", " # assert\n", " assert actual_result == \"\"\n", " \n", " def test_feed_with_none(self):\n", " # assign\n", " parser = MyHTMLParser()\n", " \n", " # act & assert\n", " with pytest.raises(Exception):\n", " parser.feed(None)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "tags": [ "hide-input" ] }, "source": [ "

\n", " \n", "

👩‍💻 Hint

\n", "\n", "You can refer to the HTMLParser documentation to learn more. \n", "\n", "

\n", "\n", "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Getting insights\n", "\n", "The most important step is to turn our data into some form from which we can draw insights. In our case, we want to extract keywords from the text, and see which keywords are more meaningful.\n", "\n", "We will use Python library called [RAKE](https://github.com/aneesha/RAKE) for keyword extraction. First, let's install this library in case it is not present: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sys\n", "!{sys.executable} -m pip install nlp_rake" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The main functionality is available from `Rake` object, which we can customize using some parameters. In our case, we will set the minimum length of a keyword to 5 characters, minimum frequency of a keyword in the document to 3, and maximum number of words in a keyword - to 2. Feel free to play around with other values and observe the result." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import nlp_rake\n", "\n", "# Extract the keywords with these conditions: max words as 2, min frequence as 3, and min characters as 5.\n", "extractor = nlp_rake.Rake(____)\n", "\n", "extracted_text = extractor.apply(text)\n", "extracted_text" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "tags": [ "hide-input" ] }, "source": [ "

\n", " \n", "

👩‍💻 Hint

\n", "\n", "You can consider to use the nlp_rake library to do that.\n", "\n", "

\n", "\n", "

" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We obtained a list terms together with associated degree of importance. As you can see, the most relevant disciplines, such as machine learning and big data, are present in the list at top positions." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4: Visualizing the result\n", "\n", "People can interpret the data best in the visual form. Thus it often makes sense to visualize the data in order to draw some insights. We can use `matplotlib` library in Python to plot simple distribution of the keywords with their relevance:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "def plot(pair_list):\n", " k, v = ____ # get the keys and values from pair_list\n", " plt.bar(____, ____) # initialize a bar graph with the k and v\n", " plt.xticks(____, ____, rotation='vertical') # set the ticks of the axies\n", " plt.show() # show the graph\n", "\n", "plot(extracted_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is, however, even better way to visualize word frequencies - using **Word Cloud**. We will need to install another library to plot the word cloud from our keyword list." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!{sys.executable} -m pip install --quiet wordcloud" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`WordCloud` object is responsible for taking in either original text, or pre-computed list of words with their frequencies, and returns and image, which can then be displayed using `matplotlib`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from wordcloud import WordCloud\n", "import matplotlib.pyplot as plt\n", "\n", "wc = WordCloud(background_color='white', width=800, height=600)\n", "plt.figure(figsize=(15, 7))\n", "plt.imshow(wc.generate_from_frequencies(____)) # unpack the extracted_text as a dict" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also pass in the original text to `WordCloud` - let's see if we are able to get similar result:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(15, 7))\n", "plt.imshow(wc.generate(____))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wc.generate(text).to_file('images/ds_wordcloud.png')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see that word cloud now looks more impressive, but it also contains a lot of noise (eg. unrelated words such as `Retrieved on`). Also, we get fewer keywords that consist of two words, such as *data scientist*, or *computer science*. This is because RAKE algorithm does much better job at selecting good keywords from text. This example illustrates the importance of data pre-processing and cleaning, because clear picture at the end will allow us to make better decisions.\n", "\n", "In this exercise we have gone through a simple process of extracting some meaning from Wikipedia text, in the form of keywords and word cloud. This example is quite simple, but it demonstrates well all typical steps a data scientist will take when working with data, starting from data acquisition, up to visualization." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Acknowledgments\n", "\n", "Thanks to Microsoft for creating the open source course [Data Science for Beginners](https://github.com/microsoft/Data-Science-For-Beginners). It inspires the majority of the content in this chapter." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.13 64-bit", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49" } } }, "nbformat": 4, "nbformat_minor": 2 }