Jekyll2020-02-17T12:19:17-08:00https://xingjunjie.me/blog/feed.xmlGavin Junjie XingMy name is Junjie Xing, you can call me Gavin.Gavin Junjie Xinggavinxing9016@gmail.comRelation Extraction2017-10-13T00:00:00-07:002017-10-13T00:00:00-07:00https://xingjunjie.me/blog/posts/2017/10/13/Relation-Extraction<p>Table of contents</p>
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#ml">Machine Learning</a>
<ul>
<li><a href="#rule">Rule based</a></li>
<li><a href="#bs">Bootstrapping</a></li>
<li><a href="#super">Supervised method</a></li>
<li><a href="#distant-super">Distant supervised method</a></li>
<li><a href="#mi">Multi-instance learning</a></li>
<li><a href="#miml">Multi-instance Multi-labeling</a></li>
</ul>
</li>
<li><a href="#nn">Neural Network</a>
<ul>
<li><a href="#snn">Simple NN model</a></li>
<li><a href="#cnn-max">CNN with max-pooling</a></li>
<li><a href="#cnn-multi-kernel">CNN with multi-sized kernels</a></li>
<li><a href="#pcnn">Piecewise CNN with multi-instance learning</a></li>
<li><a href="#att">Attention over instances</a></li>
<li><a href="#mimlcnn">Multi-instance Multi-labeling CNN</a></li>
<li><a href="#seq-tag">Sequence Tagging Approach</a></li>
</ul>
</li>
<li><a href="#reading">Recommended Reading</a></li>
<li><a href="#reference">Reference</a></li>
</ul>
<p>During the summer vacation, I worked on the <strong>extraction of diseases and its symptoms</strong> at Synyi (A medical AI startup). Since I hadn’t solved the problem before, I did a <strong>literature review</strong> to learn different methods used to solve the Relation Extraction problem.</p>
<p>On 10/11/2017, I attended a seminar hosted at <a href="https://adapt.seiee.sjtu.edu.cn/">ADAPT</a>, and one of master students (Yangyang) gave a talk about <strong>Relation Extraction (RE)</strong>. It reminded me of the literature review I did one month ago, then I decided to write this blog and I believed that it can <strong>give an overview of the development of RE methods</strong>.</p>
<h2 id="intro">Chapter 0: Introduction</h2>
<p><strong>Relation Extraction (RE)</strong> is a sub-task of <strong>Information Extraction (IE)</strong>. The other two are <strong>Named Entity Recognition (NER)</strong> and <strong>Event Extraction</strong>.</p>
<p>The purpose of RE is to solve the problem of <strong>machine reading</strong>. After constructing structured data, which machine can utilize, from unstructured text, the machine can <em>“understand” the text</em>.</p>
<div class="imgcap">
<img src="/blog/assets/re/machine-reading.png" style="border:none;" />
<div class="thecap">
Figure 1: Machine Reading
</div>
</div>
<p>Here is an example of RE:</p>
<blockquote>
<p>CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. <strong>American Airlines</strong>, <strong>a unit of AMR</strong>, immediately matched the move, <strong>spokesman Tim Wagner</strong> said. <strong>United</strong>, <strong>a unit of UAL</strong>, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York.</p>
</blockquote>
<p>There are many relations that could be extracted from the text:</p>
<table class="display cell-border" cellspacing="0" width="100%" style="text-align: center;">
<thead>
<tr>
<th>Subject</th>
<th>Relation</th>
<th>Object</th>
</tr>
</thead>
<tbody>
<tr>
<td>American Airlines</td>
<td>subsidiary</td>
<td>AMR</td>
</tr>
<tr>
<td>Tim Wagner</td>
<td>employee</td>
<td>American Airlines</td>
</tr>
<tr>
<td>United Airlines</td>
<td>subsidiary</td>
<td>UAL</td>
</tr>
<tr>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
</tbody>
</table>
<blockquote>
</blockquote>
<p><strong>In this blog</strong>, I’ll introduce different methods researchers proposed to solve the problem. The methods will be separated into 2 categories, <strong>traditional machine learning</strong> approach and recently risen <strong>neural network</strong> approach.</p>
<p>I make the <strong>assumption</strong> that readers are familiar with basic machine learning and natural language peocessing (NLP) concepts, such as <strong>classifier, POS tag, corpus</strong>, to name just a few. I’ll focus on the <strong>ideas</strong> that are used to solve the problem and will not go deep into detail such as what is a model or how to train a model.</p>
<h2 id="ml">Chapter 1: Machine Learning</h2>
<p>Researchers started to explore the field at 1990’s. They come up with different ideas to get better performance, to solve under more and more complicated circumstances.</p>
<h3 id="rule">Rule based</h3>
<p>Rule-based approach is straightforward (may not be considered as a machine learning method, I put it here because it is far more different with neural network approaches), it assumes that the <strong>pattern</strong> (“pattern” and “rule” are the same in my discussion) appears in one instance will appear again and again for the same relation type. Thus we can extract hundreds or thousands of relations pairs with a single pattern from the huge corpus, sometimes unlimited data from the Internet.</p>
<p>Rule-based approach utilizes traditional NLP tools, such as word segmentation (in Chinese), NER , POS tagger, dependency parser, etc.</p>
<p>In 1992, for instance, <a href="http://www.aclweb.org/anthology/C92-2082.pdf">Hearst et al.</a>[1] proposed a rule-based way to extract hyponymy. Some of the rules they used is listed below.</p>
<table class="display cell-border" cellspacing="0" width="100%" style="text-align: center;">
<thead>
<tr>
<th>Pattern</th>
<th>Example</th>
<th>Hyponym</th>
</tr>
</thead>
<tbody>
<tr>
<td>Y, such as {X,}* {or | and} X</td>
<td>... works by such authors as Herrick,Goldsmith, and Shakespeare. </td>
<td>("author", "Herrick") ...</td>
</tr>
<tr>
<td>X {,X}* {,} or other Y</td>
<td>Bruises, wounds, broken bones or other injuries ... </td>
<td>("injury", "bruise") ...</td>
</tr>
<tr>
<td>X {, X} * {,} and other Y</td>
<td>... temples, treasuries, and other important civic buildings.</td>
<td>("civic building", "temple") ...</td>
</tr>
<tr>
<td>Y {,} including {X,}* {or | and} X</td>
<td>All common-law countries, including Canada and England ...</td>
<td>("common-law country", "Canada")</td>
</tr>
<tr>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
</tbody>
</table>
<blockquote>
</blockquote>
<p>There are several shortcomings of the rule-based method.</p>
<ul>
<li>Requires <strong>hand-built</strong> patterns for <strong>each relation</strong>
<ul>
<li>hard to write and maintain</li>
<li>almost unlimited patterns</li>
<li>domain-dependent</li>
</ul>
</li>
<li>The accuracy is <strong>not satisfying</strong>:
<ul>
<li>Hearst (the system above): 66% accuracy.</li>
</ul>
</li>
</ul>
<h3 id="bs">Bootstrapping</h3>
<p>Now you find that it is so boring and time-consuming to look for these patterns in corpus. Fortunately, a method called <em>“bootstrapping”</em> is proposed to relief your hands.</p>
<p>When you have <strong>a lot of unlabeled data</strong> (E.g. from the Internet) and some <strong>seeds of relation-pairs or patterns</strong> that works well, you can use the seeds to find more sentences that contains the relation-pair and <strong>generate more patterns</strong> that is likely to express the same relation. So the model will learn patterns and use patterns to get more instances and patterns. After iterations, the outcome is tremendous.</p>
<p>Bootstrapping can be considered as a semi-supervised approach. Image below is a clear representation.</p>
<div class="imgcap">
<img src="/blog/assets/re/bootstrapping.png" style="border:none;" width="75%" />
<div class="thecap">
Figure 2: Bootstrapping. Image credit: Jim Martin
</div>
</div>
<p>For example, <a href="http://ilpubs.stanford.edu:8090/421/1/1999-65.pdf">Brin</a>[2] used bootstrapping to extract <em>(author, book)</em> pairs from the Internet.</p>
<p>Started with only <strong>5</strong> relation-pairs:</p>
<table class="display cell-border" cellspacing="0" width="100%" style="text-align: center;">
<thead>
<tr>
<th>Author</th>
<th>Book</th>
</tr>
</thead>
<tbody>
<tr>
<td>Isaac Asimov</td>
<td>The Robots of Dawn</td>
</tr>
<tr>
<td>David Brin</td>
<td>Startide Rising</td>
</tr>
<tr>
<td>James Gleick</td>
<td>Chaos: Making a New Science</td>
</tr>
<tr>
<td>Charles Dickens</td>
<td>Great Expectations</td>
</tr>
<tr>
<td>Wiliam Shakespeare</td>
<td>The Comedy of Errors</td>
</tr>
</tbody>
</table>
<blockquote>
</blockquote>
<p>After several iterations, over 15,000 relation-pairs were found with an accuracy of 95%.</p>
<p>It seems that the bootstrapping method is promising, however, it is not the case. The high accuracy of the above experiment <strong>doesn’t</strong> consist with other relation types. Take (President, Country) as an example, we may choose (Obama, America) as a seed. It is obvious that it will generate many noisy patterns.</p>
<blockquote>
<p>Obama returned to America after visiting Peking, China.</p>
</blockquote>
<p>Such noise would be amplified during iterations and reduce the accuracy.</p>
<p>So there are some <strong>problems</strong> of bootstrapping:</p>
<ul>
<li>Requires seeds for <strong>each relation type</strong>
<ul>
<li>result is sensitive to the seeds</li>
</ul>
</li>
<li><strong>Semantic drift</strong> at each iteration
<ul>
<li>precision not high</li>
</ul>
</li>
<li>No probabilistic interpretation</li>
</ul>
<h3 id="super">Supervised method</h3>
<p>Consider the situation that we have gathered <strong>plenty of labeled data</strong>, the relation extraction task can thus be treated as a <strong>classification</strong> task. The input is a sentence with two entities (binary relation), the output can be either a boolean value (whether there is a certain relation between the two entities) or a relation type.</p>
<p>The things we need to do are:</p>
<ol>
<li>Collect labeled data</li>
<li>Define output label</li>
<li>Define features</li>
<li>Choose a classifier</li>
<li>Train the model</li>
</ol>
<p>We can use <strong>as many as features</strong> as we want to improve the performance of the model. Practically, <strong>NLP tools</strong> are frequently used to extract features.</p>
<ul>
<li>bags of words; bi-gram before/after entities; distance between entities</li>
<li>phrase chunk path; bags of chunk heads</li>
<li>dependency-tree path; tree distance between entities</li>
<li>…</li>
</ul>
<p>Also, classifiers are <strong>free to choose</strong>:</p>
<ul>
<li>Support Vector Machine</li>
<li>Logistic Regression</li>
<li>Naive Bayes</li>
<li>…</li>
</ul>
<p><strong>To summarize</strong>:</p>
<ul>
<li>Supervised approach can achieve <strong>high accuracy</strong>
<ul>
<li>if we have access to lots of hand-labeled training data</li>
</ul>
</li>
<li><strong>Limitation</strong> is the same significant
<ul>
<li>Hand labeling is <strong>expensive</strong></li>
<li>Feature engineering is <strong>domain-dependent</strong></li>
</ul>
</li>
</ul>
<h3 id="distant-super">Distant supervised method</h3>
<p>As discussed in last section, supervised method works well when lots of labeled data is available. But labeling so much data is a great burden for researchers. With the ambition to utilize the almost unlimited unlabeled data, researchers came up with a method that can generate <strong>vast, though noisy,</strong> training data, named <strong>distant supervision</strong> (<a href="http://aclweb.org/anthology/P09-1113">Mintz et al.</a>)[3].</p>
<p>The <strong>assumption</strong> is:</p>
<blockquote>
<p>If two entities participate in a relation, <strong>any</strong> sentence containing those two entities is likely to express that relation.</p>
</blockquote>
<p>With the help of <strong>high quality relation databases</strong> (such as Freebase), we can annotate tremendous training data with the unlabeled text.</p>
<p>You may consider that it is similar to <em>bootstrapping</em>. Well, they both attempt to take advantage of unlabeled data. However, <em>bootstrapping</em> simply uses pattern to match the object, while distant supervision utilizes rich feature engineering and classifier to find a probabilistic interpretation of the RE task.</p>
<p>Distant supervised method has many <strong>advantages</strong>:</p>
<ul>
<li>Leverage rich, reliable hand-created knowledge (the databases)</li>
<li>Leverage unlimited unlabeled text data</li>
<li>Not sensitive to corpus (collecting training data step)</li>
</ul>
<h3 id="mi">Multi-instance learning</h3>
<p>The assumption made by distant supervision is strong so that the training data is noisy. In 2010, <a href="https://link.springer.com/chapter/10.1007%2F978-3-642-15939-8_10?LI=true">Riedel et al.</a>[7] <strong>relaxed the distant supervision assumption</strong> to:</p>
<blockquote>
<p>If two entities participate in a relation, <strong>at least one</strong> sentence that mentions these two entities might express that relation.</p>
</blockquote>
<p>So the task could be modeled as a multi-instance learning problem, thus exploiting the large training data created by distant supervision while being robust to the noise.</p>
<p>A multi-instance learning problem is a form of supervised learning where a label is given to <strong>a bag of instances</strong> instead of a single instance. In the context of RE, every entity-pair defines <strong>a bag consists of all sentences that mention the entity-pair</strong>. Rather than giving label to every sentence, a label is instead given to each bag of the relation entity.</p>
<h3 id="miml">Multi-instance Multi-labeling (MIML)</h3>
<p>The multi-instance learning assumes that one bag of instances only has one relation type, which is not the case in reality. It is trivial that one entity-pair can have more than one relation.</p>
<p>In 2012, <a href="https://dl.acm.org/citation.cfm?id=2391003">Surdeanu et al.</a>[9]proposed a MIML method to solve the shortcoming.</p>
<div class="imgcap">
<img src="/blog/assets/re/MIML.png" style="border:none;" width="50%" />
<div class="thecap">
Figure 3: MIML model plate diagram. Image credit: <a href="https://dl.acm.org/citation.cfm?id=2391003">Surdeanu et al.</a>[9]
</div>
</div>
<p>This method is more complicated than the previous ones, it uses multi-class classifier and several binary classifiers. I recommend you to read the <a href="https://dl.acm.org/citation.cfm?id=2391003">paper</a> on your own.</p>
<h2 id="nn">Chapter 2: Neural Network Approach</h2>
<p>Since 2014, researchers started to use neural network models to solve the task. The main idea is the same, considering RE task as a classification problem. Experiments show that neural network outperforms many state-of-art machine learning methods. In this chapter, I’ll introduce <strong>several NN architectures</strong> proposed to solve the RE task, each contains ideas that contribute to the high performance.</p>
<p>Recently another approach is proposed, treating RE as a <strong>sequence tagging</strong> problem (sometimes jointly extracted with entities), I’ll discuss this approach later in the blog.</p>
<h3 id="snn">Simple NN model</h3>
<div class="imgcap">
<img src="/blog/assets/re/simple-nn.png" style="border:none;" width="60%" />
<div class="thecap">
Figure 4: Simple NN architecture. Image credit: <a href="http://www.aclweb.org/anthology/C14-1220">Zeng et al.</a>[4]
</div>
</div>
<p>The simplest NN model uses <a href="http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf">word embedding</a>[5] as input, extract features to have a static length representation of the sentence, and then apply a linear layer to do classification.</p>
<h3 id="cnn-max">CNN with max-pooling</h3>
<p>You may be confused that I doesn’t explain the <em>feature extraction layer</em> in <em>Figure 4</em>. The truth is that I skip it deliberately because I’m willing to introduce it in this section.</p>
<p>As you can see in <em>Figure 5</em>, <a href="http://www.aclweb.org/anthology/C14-1220">Zeng et al.</a>[4] use a <em>convolution layer</em> to get fixed-length feature at sentence-level.</p>
<p>Note that in the first layer, the <strong>WF</strong> stands for <em>Word Feature</em> (word embedding), and the <strong>PF</strong> stands for <em>Position Feature</em> which encodes the word’s distance to the entities.</p>
<div class="imgcap">
<img src="/blog/assets/re/cnn-with-maxpooling.png" style="border:none;" width="60%" />
<div class="thecap">
Figure 5: Sentence-level Feature Extraction. Image credit: <a href="http://www.aclweb.org/anthology/C14-1220">Zeng et al.</a>[4]
</div>
</div>
<h3 id="cnn-multi-kernel">CNN with multi-sized kernels</h3>
<p>In 2015, <a href="https://pdfs.semanticscholar.org/eb9f/b8385c5824b029633c0cb68a8fb8573380ad.pdf">Nguyen et al.</a>[6] proposed to use multi-sized kernels to encode the sentence-level information with <em>n-gram</em> information. It is intuitive that different size of kernels are able to encode different <strong>n-gram information</strong>.</p>
<div class="imgcap">
<img src="/blog/assets/re/multi-kernel.png" style="border:none;" width="80%" />
<div class="thecap">
Figure 6: CNN with multi-sized kernels. Image credit: <a href="https://pdfs.semanticscholar.org/eb9f/b8385c5824b029633c0cb68a8fb8573380ad.pdf">Nguyen et al.</a>[6]
</div>
</div>
<h3 id="pcnn">Piecewise CNN with multi-instance learning</h3>
<p>Recall the previously discussed <strong>multi-instance learning method</strong>, in 2015, <a href="http://www.emnlp2015.org/proceedings/EMNLP/pdf/EMNLP203.pdf">Zeng et al.</a>[8] proposed <strong>a neural network approach to exploit the relaxed distant supervision assumption</strong>.</p>
<p>Given all(<script type="math/tex">T</script>) training bags <script type="math/tex">(M_{i}, y_i)</script>, <script type="math/tex">q_i</script> denotes the number of sentences in the <script type="math/tex">i^{th}</script> bag, the object function is defined using cross-entropy at bag level:</p>
<script type="math/tex; mode=display">J(\theta) = \sum_{i=1}^{T} \log p(y_i\ |\ {M{_i}{^{j^\ast}}}, \theta)</script>
<p>where <script type="math/tex">j^\ast</script> is constrained as:</p>
<script type="math/tex; mode=display">j^\ast = \arg \max_{j} p(y_i\ |\ {M{_i}{^j}}, \theta), 1 \le j \le q_i</script>
<p>From the equation we can find that <a href="http://www.emnlp2015.org/proceedings/EMNLP/pdf/EMNLP203.pdf">Zeng et al.</a>[8] <strong>give a label to a bag according to the most confident instance</strong>.</p>
<p>Another contribution in <a href="http://www.emnlp2015.org/proceedings/EMNLP/pdf/EMNLP203.pdf">Zeng et al.</a>[8] is the <strong>piecewise CNN (PCNN)</strong>. As report in the paper, the author claims that the max-pooling layer drastically reduces the size of the hidden layer and is also <strong>not sufficient to capture the structure</strong> between the entities in the sentence. This can be avoided by <strong>applying max-pooling in different segments of the sentence</strong> instead of the whole sentence. In RE task, a sentence can be naturally divided into 3 segments, before first entity, between 2 entities and after the second entity.</p>
<div class="imgcap">
<img src="/blog/assets/re/pcnn.png" style="border:none;" width="80%" />
<div class="thecap">
Figure 7: Piecewise CNN (PCNN). Image credit: <a href="http://www.emnlp2015.org/proceedings/EMNLP/pdf/EMNLP203.pdf">Zeng et al.</a>[8]
</div>
</div>
<h3 id="att">Attention over instances</h3>
<p>The shortcoming of <a href="http://www.emnlp2015.org/proceedings/EMNLP/pdf/EMNLP203.pdf">Zeng et al.</a>[8] is that it only uses that most confident instance from the bag. To overcome it, <a href="http://thunlp.org/~lyk/publications/acl2016_nre.pdf">Lin et al.</a>[10] applies <strong>attention mechanism</strong> over all the instances in the bag for the multi-instance problem.</p>
<div class="imgcap">
<img src="/blog/assets/re/attention.png" style="border:none;" width="50%" />
<div class="thecap">
Figure 8: Sentence-level
attention-based CNN. Image credit: <a href="http://thunlp.org/~lyk/publications/acl2016_nre.pdf">Lin et al.</a>[10]
</div>
</div>
<p>As shown in <em>Figure 8</em>, each sentence <script type="math/tex">x_i</script> in a bag is encoded into a distributed representation through PCNN (<a href="http://www.emnlp2015.org/proceedings/EMNLP/pdf/EMNLP203.pdf">Zeng et al.</a>[8]) or CNN. Then the feature vector representing the <script type="math/tex">i^{th}</script> bag <script type="math/tex">s_i</script> is given as,</p>
<script type="math/tex; mode=display">s_i = \sum_{j=1}^{q^i} \alpha_j x_i^j</script>
<p>Note that the <strong>attention parameter</strong> <script type="math/tex">\alpha_j</script> is officially defined in <a href="http://thunlp.org/~lyk/publications/acl2016_nre.pdf">Lin et al.</a>[10]. The equation above is <strong>a simplified version which express the same idea</strong>.</p>
<h3 id="mimlcnn">Multi-instance Multi-labeling CNN</h3>
<p>Like the MIML model proposed <a href="https://dl.acm.org/citation.cfm?id=2391003">Surdeanu et al.</a>[9], <a href="http://www.aclweb.org/anthology/C/C16/C16-1139.pdf">Jiang et al.</a>[11] proposed a MIML approach with CNN architecture, named MIMLCNN.</p>
<div class="imgcap">
<img src="/blog/assets/re/mimlcnn.png" style="border:none;" />
<div class="thecap">
Figure 9: Overall architecture of MIMLCNN. Image credit: <a href="http://thunlp.org/~lyk/publications/acl2016_nre.pdf">Jiang et al.</a>[11]
</div>
</div>
<p>The model uses a <strong>cross-sentence max-pooling</strong> to encode the <strong>bag information</strong>. In the last layer the author applies <script type="math/tex">Sigmoid</script> so that each element of the output vector can be considered as <strong>a probability of the corresponding relation type given the instance bag</strong>.</p>
<h3 id="seq-tag">Sequence Tagging Approach</h3>
<p>In many situation, the performance of <strong>pre-trained NER</strong> will influence the downstream task, relation extraction, a lot. So researchers began to have a try on <strong>joint extraction of entities and relations</strong>.</p>
<p>In <a href="http://www.aclweb.org/anthology/P17-1085">Katiyar et al.</a>[12], the joint extraction is accomplished <strong>in two steps</strong>, first NER, then RE. As shwon in <em>Figure 10</em>.</p>
<div class="imgcap">
<img src="/blog/assets/re/joint-1.png" style="border:none;" />
<div class="thecap">
Figure 10: Model architecture of <a href="http://www.aclweb.org/anthology/P17-1085">Katiyar et al.</a>[12]
</div>
</div>
<p>In <a href="https://arxiv.org/pdf/1706.05075.pdf">Zheng et al.</a>[13], the author proposed a novel tagging scheme to extract named entities and relations <strong>in one step</strong>. As shown in <em>Figure 11</em>, with the new annotating scheme, the joint extraction problem can be treated as a simple sequence tagging problem.</p>
<div class="imgcap">
<img src="/blog/assets/re/joint-tag.png" style="border:none;" />
<div class="thecap">
Figure 11: Novel tagging scheme. Image credit <a href="ttps://arxiv.org/pdf/1706.05075.pdf">Zheng et al.</a>[13]
</div>
</div>
<div class="imgcap">
<img src="/blog/assets/re/joint-2.png" style="border:none;" />
<div class="thecap">
Figure 12: Bi-LSTM model for sequence tagging. Image credit <a href="ttps://arxiv.org/pdf/1706.05075.pdf">Zheng et al.</a>[13]
</div>
</div>
<h2 id="reading">Recommended Reading</h2>
<ol>
<li><a href="http://www.cfilt.iitb.ac.in/resources/surveys/nandakumar-relation-extraction-2016.pdf">Nandakumar, Pushpak Bhattacharyya. “Relation Extraction.” (2016).</a></li>
<li><a href="https://arxiv.org/pdf/1705.03645.pdf">Kumar, Shantanu. “A Survey of Deep Learning Methods for Relation Extraction.” arXiv preprint arXiv:1705.03645 (2017).</a></li>
</ol>
<h2 id="reference">Reference</h2>
<ol>
<li>Hearst, Marti A. “Automatic acquisition of hyponyms from large text corpora.” Proceedings of the 14th conference on Computational linguistics-Volume 2. Association for Computational Linguistics, 1992.</li>
<li>Brin, Sergey. “Extracting patterns and relations from the world wide web.” International Workshop on The World Wide Web and Databases. Springer, Berlin, Heidelberg, 1998.</li>
<li>Mintz, Mike, et al. “Distant supervision for relation extraction without labeled data.” Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2. Association for Computational Linguistics, 2009.</li>
<li>Zeng, Daojian, et al. “Relation Classification via Convolutional Deep Neural Network.” COLING. 2014.</li>
<li>Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).</li>
<li>Nguyen, T. H., & Grishman, R. (2015, June). Relation Extraction: Perspective from Convolutional Neural Networks. In VS@ HLT-NAACL (pp. 39-48).</li>
<li>Riedel, Sebastian, Limin Yao, and Andrew McCallum. “Modeling relations and their mentions without labeled text.” Machine learning and knowledge discovery in databases (2010): 148-163.</li>
<li>Zeng, D., Liu, K., Chen, Y., & Zhao, J. (2015, September). Distant Supervision for Relation Extraction via Piecewise Convolutional Neural Networks. In Emnlp (pp. 1753-1762).</li>
<li>Surdeanu, M., Tibshirani, J., Nallapati, R., & Manning, C. D. (2012, July). Multi-instance multi-label learning for relation extraction. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning (pp. 455-465). Association for Computational Linguistics.</li>
<li>Lin, Yankai, et al. “Neural Relation Extraction with Selective Attention over Instances.” ACL (1). 2016.</li>
<li>Jiang, Xiaotian, et al. “Relation Extraction with Multi-instance Multi-label Convolutional Neural Networks.” COLING. 2016.</li>
<li>Katiyar, Arzoo, and Claire Cardie. “Going out on a limb: Joint Extraction of Entity Mentions and Relations without Dependency Trees.” Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vol. 1. 2017.</li>
<li>Zheng, Suncong, et al. “Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme.” arXiv preprint arXiv:1706.05075 (2017).</li>
</ol>Gavin Junjie Xinggavinxing9016@gmail.comA literature review of relation extraction.Representation of Undirected Graphical Model2017-08-15T00:00:00-07:002017-08-15T00:00:00-07:00https://xingjunjie.me/blog/posts/2017/08/15/Representation-of-Undirected-Graphical-Models<p>This page’s Markdown is generated via pandoc from LaTex.
If you feel more comfortable with a LaTex layout, please check <a href="/blog/assets/pgm/lecture3/lecture3.pdf">here</a>.
The original Tex file is also available <a href="/blog/assets/pgm/lecture3/lecture3.tex">here</a>.</p>
<h2 id="1-review">1. Review</h2>
<p>There are several important concepts and theorems introduced in last
lecture about Directed Graphical Models.</p>
<ul>
<li>
<p>Local independence:
<script type="math/tex">For\ each\ variable\ X_i: (X_i \perp NonDescendant_{X_i} | Pa_{x_i}).</script>
Indicate that in a directed graph, each variable is independent to
its nondescendants given its parent.</p>
</li>
<li>
<p>Global independence:
<script type="math/tex">I(G) = \{(\mathbf{X}\perp{\mathbf{Y}}\ |\ \mathbf{Z})\ :\ d\textrm{-}sep_G(\mathbf{X} ; \mathbf{Y} \ |\ \mathbf{Z})\}.</script>The
<em>global</em> independence is given by d-seperation. Note that there is
no need to consider too much about <em>global and local</em> things, you
can call them whatever you want.</p>
</li>
<li>
<p>A fully connected DAG <script type="math/tex">\mathcal{G}</script> is an I-map of <em>any</em>
distribution, since <script type="math/tex">I_{l}(\mathcal{G}) = \emptyset \subset I(P)</script>
for any <script type="math/tex">P</script>.</p>
</li>
<li>
<p>Minimal I-map: A DAG <script type="math/tex">\mathcal{G}</script> is a minimal I-map of <script type="math/tex">P</script>, if the
removal of even a single edge from <script type="math/tex">\mathcal{G}</script> renders it not an
I-map.</p>
</li>
<li>
<p>A distribution may have several I-maps.</p>
</li>
<li>
<p>P-map: A DAG <script type="math/tex">\mathcal{G}</script> is a perfect map (p-map) of a
distribution <script type="math/tex">P</script> if <script type="math/tex">I(P)=I(\mathcal{G})</script></p>
</li>
</ul>
<p>Note that not every distribution has a perfect map as DAG. Here is an
example:</p>
<script type="math/tex; mode=display">A\perp C|\{B,D\}\quad B\perp D|\{A, C\}</script>
<div class="imgcap">
<img src="/blog/assets/pgm/lecture3/assets/dgm_unable.png" style="border:none;width:80%" />
<div class="thecap">
Figure 1: Unable of Bayesian Network
</div>
</div>
<p>BN1 wrongly says <script type="math/tex">B\perp D|A</script>, BN2 wrongly says <script type="math/tex">B\perp{D}</script></p>
<p>It is impossible for a DAG to capture both of the two independences at
same time. The main reason is that the directed model (sometimes)
encodes more independences together with the one we want. Thus, there is
a portion of the space of distribution that we cannot encode with a DGM.
That motivates another type of graphical model: undirected graphical
models, aka Markov Random Fields.</p>
<h2 id="2-undirected-graphical-models">2. Undirected Graphical Models</h2>
<p>UGMs are very similar to DGMs in structure; but the directed or
undirected edges encode differently. The directed model encodes <em>causal</em>
relationship between nodes, while UGMs captures pairwise relationship
which represents <em>correlation</em> between nodes, rough affinity.</p>
<p>Many things can be modeled as a UGM, such as a photo—each pixel can be a
node, a go game—the grid chessboard seems intuitive, or even social
networks, as shown in figure 2.</p>
<div class="imgcap">
<img src="/blog/assets/pgm/lecture3/assets/ugm_ex1.png" style="border:none;width:30%" />
<img src="/blog/assets/pgm/lecture3/assets/ugm_ex2.png" style="border:none;width:30%" />
<img src="/blog/assets/pgm/lecture3/assets/ugm_ex3.png" style="border:none;width:30%" />
<div class="thecap">
Figure 2: Example of Undirected model
</div>
</div>
<h2 id="3-representation">3. Representation</h2>
<p><strong>Definition</strong> an undirected graphical model represents a distribution
<script type="math/tex">P(X_1,\ldots,X_n)</script> defined by an undirected graph <script type="math/tex">H</script>, and a set of
positive potential functions <script type="math/tex">y_c</script> associated with the cliques of <script type="math/tex">H</script>,
s.t.</p>
<script type="math/tex; mode=display">P(X_1,\ldots,X_n) = \frac{1}{Z} \prod_{c\in C}{\psi_c(X_c)}
\label{equation:1}</script>
<p>where <script type="math/tex">Z</script> is known as a partition function:</p>
<script type="math/tex; mode=display">Z = \sum_{X_1, \ldots, X_n} \prod_{c\in C}(\psi_c(X_c))</script>
<p>The potential function can be understood as an contingency function of
its arguments assigning “pre-probabilistic” score of their joint
configuration. We call this of distribution in equation above
as <strong>Gibbs distribution</strong>, as <em>Definition 4.3 in Koller textbook</em>. And
the potential function is defined as <strong>factor</strong> in Koller textbook.</p>
<p><strong>Definition</strong> For <script type="math/tex">G={V, E}</script>, a complete subgraph (clique) is a subgraph
<script type="math/tex">G'={V'\subseteq {V},E'\subseteq{E}}</script> such that nodes in <script type="math/tex">V'</script> are fully
interconnected.A (maximal) clique is a complete subgraph s.t. any
superset <script type="math/tex">V^{\prime\prime} \supset V'</script> is not complete.</p>
<h3 id="interpretation-of-clique-potentials">Interpretation of Clique Potentials</h3>
<div class="imgcap">
<img src="/blog/assets/pgm/lecture3/assets/clique_potential.png" style="border:none;width:35%" />
</div>
<p>The model implies <script type="math/tex">X\perp Z|Y</script>. This independence statement implies (by
definition) that the joint must factorize
as:</p>
<script type="math/tex; mode=display">p(x,y,z)=p(y)p(x|y)p(z|y)</script>
<p>We can write this as</p>
<script type="math/tex; mode=display">p(x,y,z)=p(x,y)p(z|y)</script>
<p>or</p>
<script type="math/tex; mode=display">p(x,y,z)=p(x|y)p(z,y)</script>
<p>However, we <strong>cannot</strong> have all potentials be marginals and cannot have all
potentials be conditionals.</p>
<p>The positive clique potentials can only be thought of as general
“compatibility”, “goodness” or “happiness” functions over their
variables, but <strong>not as probability distributions</strong>.</p>
<h3 id="example-ugm--using-max-cliques">Example UGM — using max cliques</h3>
<p>Here we’ll use an example to show an UGM.</p>
<div class="imgcap">
<img src="/blog/assets/pgm/lecture3/assets/ugm_max_clique.png" style="border:none;width:80%" />
</div>
<p>We can factorize the graph into two max cliques:</p>
<script type="math/tex; mode=display">P(x_1,x_2,x_3,x_4)=\frac{1}{Z}\psi_c(X_{123})\times \psi_c(X_{234})</script>
<script type="math/tex; mode=display">Z=\sum_{x_1,x_2,x_3,x_4}\psi_c(X_{123})\times \psi_c(X_{234})</script>
<p>We can represent <script type="math/tex">P(X_{1:4})</script> as two 3D tables instead of one 4D table.</p>
<h3 id="using-subcliques">Using subcliques</h3>
<p>In this example, the distribution factorized over the subcliques.</p>
<div class="imgcap">
<img src="/blog/assets/pgm/lecture3/assets/ugm_sub_clique.png" style="border:none;width:40%" />
</div>
<script type="math/tex; mode=display">% <![CDATA[
\begin{split}
P(x_1,x_2,x_3,x_4) & = \frac{1}{Z}\prod_{ij}\psi_{ij}(X_{ij}) \\
& = \frac{1}{Z}\psi_{12}(X_{12})\psi_{14}(X_{14})\psi_{23}(X_{23})\psi_{24}(X_{24})\psi_{34}(X_{34}) \\
Z & = \sum_{x_1,x_2,x_3,x_4}\prod_{ij}\psi_{ij}(X_{ij})
\end{split} %]]></script>
<h3 id="example-ugm--canonical-representation">Example UGM — canonical representation</h3>
<p>A canonical representation of such a graph can be expressed as:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{split}
P(x_1,x_2,x_3,x_4) & = \frac{1}{Z}\psi_c(X_{123})\times \psi_c(X_{234}) \\
& \times \frac{1}{Z}\psi_{12}(X_{12})\psi_{14}(X_{14})\psi_{23}(X_{23})\psi_{24}(X_{24})\psi_{34}(X_{34}) \\
& \times \psi_{x_1}(x_1)\psi_{x_2}(x_2)\psi_{x_3}(x_3)\psi_{x_4}(x_4) \\
Z & = \sum_{x_1,x_2,x_3,x_4} \ldots
\end{split} %]]></script>
<h2 id="4-independence-properties">4. Independence properties</h2>
<h3 id="global-independence">Global independence</h3>
<p><strong>Definition</strong> A set of nodes <script type="math/tex">Z</script> separates <script type="math/tex">X</script> and <script type="math/tex">Y</script> in <script type="math/tex">H</script>, denoted
<script type="math/tex">sep_H(X : Y |Z)</script>, if there is no active path between any node
<script type="math/tex">X \in \mathbf{X}</script> and <script type="math/tex">Y \in \mathbf{Y}</script> given <script type="math/tex">\mathbf{Z}</script>. Global
independences associated with <script type="math/tex">H</script> are defined as:</p>
<script type="math/tex; mode=display">I(H)={(X\perp Y|Z) :sep_H( X :Y|Z)}</script>
<div class="imgcap">
<img src="/blog/assets/pgm/lecture3/assets/ugm_separate.png" style="border:none;width:60%" />
<div class="thecap">
Figure 3: Illustrate separation.
</div>
</div>
<p>In Figure 3, B separates A and C if every path from a
node in A to a node in C passes through a node in B. It is written as
sepH(A : C|B). A probability distribution satisfies the global Markov
property if for any disjoint A,B,C such that B separates A and C, A is
independent of C given B.</p>
<h3 id="local-independence">Local independence</h3>
<div class="imgcap">
<img src="/blog/assets/pgm/lecture3/assets/ugm_local.png" style="border:none;width:50%" />
<div class="thecap">
Figure 4: Illustration of Markov Blanket in undirected graph.
</div>
</div>
<p><strong>Definition</strong> For each node <script type="math/tex">X_i \in V</script>, there is unique Markov blanket of <script type="math/tex">X_i</script> ,
denoted <script type="math/tex">MB_{X_i}</script> , which is the set of neighbors of <script type="math/tex">X_i</script> in the graph
(those that share an edge with <script type="math/tex">X_i</script> )</p>
<p><strong>Definition</strong> The local Markov independencies associated with H is:</p>
<script type="math/tex; mode=display">I_l(H): \{X_i \perp V - \{X_i\} - MB_{x_i} | MB_{x_i} : \forall i\}</script>
<p>In other words, X i is independent of the rest of the nodes in the graph
given its immediate neighbors.</p>
<p>Note that, based on the local independence:</p>
<script type="math/tex; mode=display">P(X_i|X_{-i}=P(X_i|MB_{x_i})</script>
<h3 id="soundness-and-completeness-of-global-markov-property">Soundness and completeness of global Markov property</h3>
<ul>
<li>
<p><strong>Definition</strong> An UG <script type="math/tex">H</script> is an I-map for a distribution <script type="math/tex">P</script> if <script type="math/tex">I(H) \subseteq I(P)</script>, i.e., <script type="math/tex">P</script> entails <script type="math/tex">I(H)</script>.</p>
</li>
<li>
<p><strong>Definition</strong> P is a Gibbs distribution over H if it can be represented as</p>
</li>
</ul>
<script type="math/tex; mode=display">P(X_1, \ldots, X_n) = \frac{1}{Z}\prod_{c\in C}\psi_c(X_c)</script>
<ul>
<li>
<p><strong>Theorem</strong> (soundness): If <script type="math/tex">P</script> is a Gibbs distribution over <script type="math/tex">H</script>, then <script type="math/tex">H</script> is an I-map of <script type="math/tex">P</script>.</p>
</li>
<li>
<p><strong>Theorem</strong> (Completeness): If <script type="math/tex">X</script> and <script type="math/tex">Y</script> are not separated given <script type="math/tex">Z</script> in <script type="math/tex">H</script>
(<script type="math/tex">\lnot sep_H (X ; Z |Y )</script>), then <script type="math/tex">X</script> and <script type="math/tex">Y</script> are dependent given <script type="math/tex">Z</script>, in some
distribution <script type="math/tex">P</script> represented as (<script type="math/tex">X \not\perp_P Z|Y</script>) that factorizes
over <script type="math/tex">H</script>.</p>
</li>
</ul>
<p>The proof of the theorems are available on Koller textbook.</p>
<h3 id="other-markov-properties">Other Markov properties</h3>
<p>For directed graphs, we defined I-maps in terms of local Markov
properties, and derived global independence.For undirected graphs, we
defined I-maps in terms of global Markov properties, and will now derive
local independence.</p>
<p>The pairwise Markov independencies associated with UG <script type="math/tex">H = (V;E)</script> are</p>
<script type="math/tex; mode=display">I_p(H)=\{(X\perp Y|V-\{X,Y\}):{X,Y}\notin E\}</script>
<div class="imgcap">
<img src="/blog/assets/pgm/lecture3/assets/ugn_pair_independence.png" style="border:none;width:50%" />
<div class="thecap">
Figure 5: airwise independence in undirected graph. Red nodes are observed.
</div>
</div>
<p>For example, in figure 5, we have the following independence</p>
<script type="math/tex; mode=display">X_1\perp X_5 | \{X_2, X_3,X_4\}</script>
<h3 id="relationship-between-local-and-global-markov-properties">Relationship between local and global Markov properties</h3>
<ul>
<li>
<p>For any Markov Network H, and any distribution P, we have that if
<script type="math/tex">P \models I_l(H)</script> then <script type="math/tex">P \models I_p(H)</script></p>
</li>
<li>
<p>For any Markov Network H, and any distribution P, we have that if
<script type="math/tex">P \models I_l(H)</script> then <script type="math/tex">P \models I_p(H)</script></p>
</li>
<li>
<p>Let P be a positive distribution. If <script type="math/tex">P \models I_l(H)</script>, then
<script type="math/tex">P \models I_p(H)</script></p>
</li>
</ul>
<p>The following three statements are equivalent for a positive
distribution P:</p>
<ul>
<li>
<script type="math/tex; mode=display">P \models I_l(H)</script>
</li>
<li>
<script type="math/tex; mode=display">P \models I_p(H)</script>
</li>
<li>
<script type="math/tex; mode=display">P \models I(H)</script>
</li>
</ul>
<p>Above equivalence relies on the positivity assumption of <script type="math/tex">P</script>. For
nonpositive distributions, there are examples of distributions <script type="math/tex">P</script>,
there are examples which satisfies one of these properties, but not the
stronger property.</p>
<h3 id="perfect-maps">Perfect maps</h3>
<p><strong>Definition</strong> A Markov network <script type="math/tex">H</script> is a perfect map for <script type="math/tex">P</script> if for any <script type="math/tex">X</script>; <script type="math/tex">Y</script>;<script type="math/tex">Z</script> we have that</p>
<script type="math/tex; mode=display">sep_H(X;Z|Y) \Leftrightarrow P \models (X\perp Z|Y)</script>
<p>Note that, just like DMs, not every distribution has a perfect map as
UGM.</p>
<h3 id="exponential-form">Exponential Form</h3>
<p>Constraining clique potentials to be positive could be inconvenient
(e.g., the interactions between a pair of atoms can be either attractive
or repulsive). We represent a clique potential <script type="math/tex">\psi_x(X_c)</script> in an
unconstrained form using a real-value “energy” function <script type="math/tex">\psi_x(X_c)</script>:</p>
<script type="math/tex; mode=display">\psi_c(X_c) = exp\{-\phi_c(X_c)\}</script>
<p>Thus, this gives the joint distribution an additive structure:</p>
<script type="math/tex; mode=display">P(X)=\frac{1}{Z}exp\{-\sum_{c\in C}\phi_c(X_c)\} = \frac{1}{Z}exp\{-H(X)\}</script>
<p>where the <script type="math/tex">H(X)</script> is called the “free energy”.</p>
<p>The exponential ensures that the distribution is positive. In physics,
this is called the “Boltzmann distribution”.In statistics, this is
called a log-linear model (as Koller textbook introduces).</p>Gavin Junjie Xinggavinxing9016@gmail.comMy scribe on lecture 3, CMU 10-708. Explain why DM is not enough, and introduce UGMs.Directed GMs: Bayesian Networks2017-08-11T00:00:00-07:002017-08-11T00:00:00-07:00https://xingjunjie.me/blog/posts/2017/08/11/Directed-GMs-Bayesian-Networks<p>This page’s Markdown is generated via pandoc from LaTex.
If you feel more comfortable with a LaTex layout, please check <a href="/blog/assets/pgm/lecture2/lecture2.pdf">here</a>.
The original Tex file is also available <a href="/blog/assets/pgm/lecture2/lecture2.tex">here</a>.</p>
<h2 id="1-introduction">1. Introduction</h2>
<p>The goal of establishing GMs (Graphical Models) is to represent a joint
distribution <script type="math/tex">P</script> over some set of random variables
<script type="math/tex">\mathbf{\chi} = \{X_1,\ldots,X_n\}</script>. Consider the simplest case where
each variable is binary-valued, a joint distribution requires total
<script type="math/tex">2^n - 1</script> numbers (minus 1 comes from sum-to-one constraint).This
explicit representation of the joint distribution is unmanageable from
every perspective.</p>
<ul>
<li>
<p><strong>Computationally</strong>, it’s very expensive to manipulate and too large
to store in memory.</p>
</li>
<li>
<p><strong>Cognitively</strong>, it is impossible to acquire so many numbers from a
human expert, and the numbers are very small and do not correspond
to events that people can reasonably contemplate.</p>
</li>
<li>
<p><strong>Statistically</strong>, if we want to learn the distribution from date,
we would need ridiculously large amounts of data to estimate this
many parameters robustly.</p>
</li>
</ul>
<p>However, <strong>Bayesian Networks</strong> are able to represent compact
representations by exploiting <strong>Independence Properties</strong>.</p>
<h2 id="2-the-student-example">2. The <em>student</em> Example</h2>
<p>We’ll introduce perhaps the simplest example to see how <strong>independence
assumptions</strong> produce a very compact representation of a
high-dimensional distribution.</p>
<p>We now assume that a company would like to hire some graduates. The
company’s goal is to hire intelligent employees, but there is no way to
test intelligence directly. However, the company have access to
student’s SAT scores and course grades. Thus, our probability space is
induced by three relevant random variables <script type="math/tex">I, S</script> and <script type="math/tex">G</script>. Assuming that
<script type="math/tex">G</script> takes on three values <script type="math/tex">g^1,g^2,g^3</script>, representing grades <script type="math/tex">A, B</script> and
<script type="math/tex">C</script>, <script type="math/tex">I</script> takes on two values <script type="math/tex">i^0</script>(low intelligence), <script type="math/tex">i^1</script>(high
intelligence), <script type="math/tex">S</script> takes on two values <script type="math/tex">s^0</script>(low score) and <script type="math/tex">s^1</script>(high
score).</p>
<p>We can get some intuitive independences in this example. The student’s
intelligence is clearly correlated both with his SAT score and grade.
The SAT score and grade are also not independent.If we on the fact that
the student received a high score on his SAT, the chances that he gets a
high grade in his class are also likely to increase. Thus, we assume
that, for our distribution <script type="math/tex">P</script>,</p>
<script type="math/tex; mode=display">P(g^1\ |\ s^1) > P(g^1\ |\ s^0)</script>
<p>However, it’s quite plausible that our distribution <script type="math/tex">P</script> satisfies a
<strong>conditional independence property</strong>. If we know that the student has
high intelligence, a high grade on the SAT no longer gives us
information about the student’s performance in the class. That is:</p>
<script type="math/tex; mode=display">P(g\ |\ i^1,s^1) = P(g\ |\ i^1)</script>
<p>Generally, we may assume that</p>
<script type="math/tex; mode=display">P\models(S\perp{G\ |\ I)}</script>
<p>Note that this independence holds only if we assume that student’s
intelligence is the only reason why his grade and SAT score might be
correlated, which means that it assumes that there is no correlations
due to other factors. These assumptions are also not “True” in any
formal sense of word, and they are often only approximations of our true
beliefs.</p>
<div class="imgcap">
<img src="/blog/assets/pgm/lecture2/assets/student_nb.png" style="border:none;" />
<div class="thecap">
Figure 1: Simple Bayesian networks for the student
</div>
</div>
<p>As in the case of marginal independence, conditional independences
allows us to provide a compact specification of the joint distribution.
The compact representation is based on a very natural alternative
parameterization. By simple probabilistic reasoning, we have that</p>
<script type="math/tex; mode=display">P(I,S,G) = P(S,G \ |\ I)P(I).</script>
<p>But now, the <strong>conditional independence assumption</strong> implies</p>
<script type="math/tex; mode=display">P(S,G\ |\ I) = P(S\ |\ I)P(G\ |\ I).</script>
<p>Hence, we have that</p>
<script type="math/tex; mode=display">P(I,S,G) = P(S\ |\ I)P(G\ |\ I)P(I)</script>
<p>Thus, we have factorized the joint distribution <script type="math/tex">P(I,G,G)</script> as a product
of three conditional probability distributions (CPDs). This
factorization immediately leads us to the desired alternative
parameterization. Together with <script type="math/tex">P(I), P(S\ |\ I), P(G\ |\ I)</script>, we can
specify the joint distribution. For example, <script type="math/tex">P(i^1,s^1,g^2) = P(i^1)P(s^1\ |\ i^1)P(g^2\ |\ i^1)</script>.</p>
<p>We note that this probabilistic model would be represented using the
Bayesian network shown in Figure 1.</p>
<p>In this case, the alternative parameterization is more compact than the
joint. We now have three binomial distributions — <script type="math/tex">P(I)</script>, <script type="math/tex">P(S\ |\ i^1)</script>
and <script type="math/tex">P(S\ |\ i^0)</script>, and two three-valued multinomial distributions —
<script type="math/tex">P(G\ |\ i^1)</script> and <script type="math/tex">P(G\ |\ i^0)</script>. Each of the binomials requires one
independent parameter, and each three-valued multinomial requires two
independent parameters, for a total of <strong>seven</strong>
(<script type="math/tex">3 * (2 - 1) + 2 * (3 - 1)</script>).By contract, our joint distribution has
twelve entries, so that <strong>eleven</strong> independent parameters.</p>
<h2 id="3-bayesian-networks">3. Bayesian Networks</h2>
<p>Bayesian networks build on the intuition as the naive Bayes model by
exploiting conditional independence properties in order to allow a
compact and natural representation.However, they are not restricted to
the strong independence assumptions naive Bayes model makes.</p>
<p>The core of the Bayesian network representation is a directed acyclic
graph (DAG), whose nodes are the random variables in our domain and
whose edges correspond, intuitively, to direct influence of one node on
another.</p>
<p>We can view the graph in two ways:</p>
<ul>
<li>
<p>a data structure that provides the skeleton for representing <strong>a
joint distribution</strong> compactly in a <em>factorized</em> way.</p>
</li>
<li>
<p>a compact representation for <strong>a set of conditional independence
assumptions</strong> about a distribution.</p>
</li>
</ul>
<h3 id="factorization-theorem">Factorization Theorem</h3>
<p>Given a DAG, the most general form of the probability distribution that
is <strong>consistent</strong> with the graph factors according to “<strong>node given its
parents</strong>”:</p>
<script type="math/tex; mode=display">P(X) = \prod_{1=1:d}{P(X_i\ |\ X_{\pi_i})}</script>
<p>where <script type="math/tex">X_{\pi_i}</script>is the set of parent node of <script type="math/tex">x_i</script>, and <script type="math/tex">d</script> is the number of
nodes.See Figure 2 for an example. This graph
can be factorized and represented as follows: <script type="math/tex"></script>\begin{split}
&P(X_1,X_2,X_3,X_4,X_5,X_6,X_7,X_8) = <br />
&P(X_1)P(X_2)P(X_3\ |\ X_1)P(X_4\ |\ X_2)P(X_5\ |\ X_2)P(X_6\ |\ X_3, X_4)P(X_7\ |\ X_6)P(X_8\ |\ X_5, X_6)
\end{split}<script type="math/tex"></script></p>
<div class="imgcap">
<img src="/blog/assets/pgm/lecture2/assets/factorize_example.png" style="border:none;width:60%;" />
<div class="thecap">
Figure 2: Factorize example
</div>
</div>
<h3 id="local-structures-and-independences">Local Structures and Independences</h3>
<p>Graphical models have three fundamental local structures that composes
bigger structures.</p>
<ul>
<li>
<p><strong>Common parent</strong> Fixing <script type="math/tex">B</script> decouples <script type="math/tex">A</script> and <script type="math/tex">C</script>. When two
variables <script type="math/tex">A</script> and <script type="math/tex">C</script> have a common parent <script type="math/tex">B</script>, conditional
independence <script type="math/tex">A\perp C\ |\ B</script> holds.</p>
</li>
<li>
<p><strong>Cascade</strong> Knowing <script type="math/tex">B</script> decouples <script type="math/tex">A</script> and <script type="math/tex">C</script>. When a middle node in
a cascaded three random variables is known, a conditional
independence <script type="math/tex">A\perp C\ |\ B</script> holds.</p>
</li>
<li>
<p><strong>V-structure</strong> If <script type="math/tex">C</script> is not observed, then <script type="math/tex">A</script> and <script type="math/tex">B</script> are
independent. However, if it is given, then the independence is lost.
(<script type="math/tex">A</script> and <script type="math/tex">B</script> are not independent given <script type="math/tex">C</script>). In this case, <script type="math/tex">A</script> and
<script type="math/tex">B</script> are <em>marginally independent</em>.</p>
</li>
</ul>
<p>The unintuitive V-structure can be described by a simple example.
Suppose <script type="math/tex">A =</script> clock on tower, <script type="math/tex">B =</script> traffic jam on Eric’s way to
campus, and <script type="math/tex">C =</script> Eric on time for class. If Eric is not on time and
the clock is on time, then our belief that <script type="math/tex">B</script> occurred is higher.</p>
<h2 id="4-i-maps">4. I-maps</h2>
<p><strong>Definition 4.1</strong> Let <script type="math/tex">P</script> be a distribution over <script type="math/tex">X</script>. We define <script type="math/tex">I(P)</script> to be the set of
independence assertions of the form <script type="math/tex">(X \perp Y\ |\ Z)</script> that hold in P.</p>
<p><strong>Definition 4.2</strong> Let <script type="math/tex">K</script> be an any graph object associated with a set of independences
<script type="math/tex">I(K)</script>. Then <script type="math/tex">K</script> is an <script type="math/tex">I-map</script> for a set of independences <script type="math/tex">I</script> if
<script type="math/tex">I(K) \subseteq I</script></p>
<p>For example, if a graph <script type="math/tex">K</script> is totally connected, then every pair of
variables are dependent, more formally, <script type="math/tex">I(K) = \emptyset \subset P</script>. A
complete graph is “useless”, since it does not give any knowledge about
the structural.</p>
<h3 id="facts-about-i-maps">Facts about I-maps</h3>
<p>For <script type="math/tex">G</script> to be an I-map of <script type="math/tex">P</script>, it is necessary that <script type="math/tex">G</script> does not mislead
us regarding independences in <script type="math/tex">P</script>. In other words, any independence that
<script type="math/tex">G</script> asserts must also hold in <script type="math/tex">P</script>, but conversely, <script type="math/tex">P</script> may have
additional independences that are not reflected in <script type="math/tex">G</script>.</p>
<div class="imgcap">
<img src="/blog/assets/pgm/lecture2/assets/imap_example.png" style="border:none;" />
<div class="thecap">
Figure 3: I-map example
</div>
</div>
<p>Example:</p>
<p>Consider a joint probability space over two independent random variables
<script type="math/tex">X</script> and <script type="math/tex">Y</script> . There are three possible graphs (as shown in Figure
3 over these two nodes: <script type="math/tex">G_\emptyset</script>, which is a
disconnected pair <script type="math/tex">X</script> <script type="math/tex">Y</script> ; <script type="math/tex">G_{X\rightarrow Y}</script> , which has the edge
<script type="math/tex">X\rightarrow Y</script> ; and <script type="math/tex">G_{X\rightarrow Y}</script> , which contains
<script type="math/tex">Y\rightarrow X</script>. The graph <script type="math/tex">G_\emptyset</script> encodes the assumption that
<script type="math/tex">(X \perp Y )</script>. The latter two encode no independence assumptions.</p>
<p>Consider following two distributions:</p>
<div class="imgcap">
<img src="/blog/assets/pgm/lecture2/assets/4-1distribution.png" style="border:none;width:50%;" />
<div class="thecap">
</div>
</div>
<p>In the example on the left, <script type="math/tex">X</script> and <script type="math/tex">Y</script> are independent in <script type="math/tex">P</script>; for
example, <script type="math/tex">P(x^1) = 0.48 + 0.12 = 0.6</script>, <script type="math/tex">P(y^1) = 0.8</script>, and
<script type="math/tex">P(x^1, y^1) = 0.48 = 0.6 · 0.8</script>. Thus, <script type="math/tex">(X \perp Y ) \in I(P)</script>, and we
have that <script type="math/tex">G_\emptyset</script> is an I-map of <script type="math/tex">P</script>. In fact, all three graphs
are I-maps of <script type="math/tex">P</script>: <script type="math/tex">I(G_{X\rightarrow Y})</script> is empty, so that trivially
<script type="math/tex">P</script> satisfies all the independences in it (similarly for
<script type="math/tex">G_{Y\rightarrow X}</script> ). In the example on the right,
<script type="math/tex">(X \perp Y) \not\in I(P)</script>, so that <script type="math/tex">G_\emptyset</script> is not an I-map of
<script type="math/tex">P</script>. Both other graphs are I-maps of <script type="math/tex">P</script>.</p>
<h3 id="local-independences">Local independences</h3>
<p><strong>Definition 4.3</strong> A Bayesian network structure <script type="math/tex">G</script> is a directed acyclic graph whose nodes
represent random variables <script type="math/tex">X_1,\ldots,X_n</script> . Let <script type="math/tex">Pa_{X_i}</script> denote the
parents of <script type="math/tex">X_i</script> in <script type="math/tex">G</script>, and <script type="math/tex">NonDescendants_{X_i}</script> denote the variables
in the graph that are not descendants of <script type="math/tex">X_i</script> . Then <script type="math/tex">G</script> encodes the
following set of <strong>local conditional independence assumptions</strong></p>
<script type="math/tex; mode=display">I_l (G) : For\ each\ variable\ X_i: (X_i \perp NonDescendant_{X_i} | Pa_{x_i}).</script>
<p>In other words, a node <script type="math/tex">X_i</script> is independent of any non descendants given
its parents.</p>
<h2 id="5-d-separation">5. D-separation</h2>
<p><strong>Direct connection</strong> The simple case is that <script type="math/tex">X</script> and <script type="math/tex">Y</script> are directly
connected via an edge, say <script type="math/tex">X \rightarrow Y</script>. For any network structure
<script type="math/tex">G</script> that contains the edge <script type="math/tex">X \rightarrow Y</script> , it is possible to
construct a distribution where <script type="math/tex">X</script> and <script type="math/tex">Y</script> are correlated regardless of
any evidence about any of the other variables in the network. In other
words, if <script type="math/tex">X</script> and <script type="math/tex">Y</script> are directly connected, we can always get examples
where they influence each other, regardless of <script type="math/tex">Z</script>.</p>
<div class="imgcap">
<img src="/blog/assets/pgm/lecture2/assets/xyz_trail.png" style="border:none;" />
<div class="thecap">
Figure 4: The four possible two-edge trails from X to Y via Z
</div>
</div>
<p><strong>Indirect connection</strong> Now consider the more complicated case when X and
Y are not directly connected, but there is a trail between them in the
graph. We begin by considering the simplest such case: a three-node
network, where X and Y are not directly connected, but where there is a
trail between them via Z. It is clear that there are four cases where X
and Y are connected via Z, as shown in Figure 4.</p>
<ul>
<li>
<p>Causal trail <script type="math/tex">X \rightarrow Y \rightarrow Z</script>, and evidential trail
<script type="math/tex">X \leftarrow Y \leftarrow Z</script>: active iff <script type="math/tex">Z</script>is not observed. These
two is shown in Figure 4(a),(b)</p>
</li>
<li>
<p>Common cause <script type="math/tex">X \leftarrow Z \rightarrow Y</script> : active iff <script type="math/tex">Z</script> is not
observed.</p>
</li>
<li>
<p>Common effect <script type="math/tex">X \rightarrow Z \leftarrow Y</script> : active iff <script type="math/tex">Z</script> or one
of its descendants is observed.</p>
</li>
</ul>
<p><strong>Definition 5.1</strong> Let <script type="math/tex">\mathbf{X}</script>, <script type="math/tex">\mathbf{Y}</script> , <script type="math/tex">\mathbf{Z}</script> be three sets of nodes in
<script type="math/tex">G</script>. We say that <script type="math/tex">\mathbf{X}</script> and <script type="math/tex">\mathbf{Y}</script> are dseparated given
<script type="math/tex">\mathbf{Z}</script>, denoted
<script type="math/tex">d-sep_G(\mathbf{X} ; \mathbf{Y} \ |\ \mathbf{Z})</script>, if there is no
active trail between any node <script type="math/tex">X \in \mathbf{X}</script> and <script type="math/tex">Y \in \mathbf{Y}</script>
given <script type="math/tex">\mathbf{Z}</script>. We use <script type="math/tex">I(G)</script> to denote the set of independences
that correspond to d-separation:</p>
<script type="math/tex; mode=display">I(G) = \{(\mathbf{X}\perp{\mathbf{Y}}\ |\ \mathbf{Z})\ :\ d\textrm{-}sep_G(\mathbf{X} ; \mathbf{Y} \ |\ \mathbf{Z})\}.</script>
<p>This set is also called the set of <strong>global Markov independences</strong>.</p>
<h2 id="6-soundness-and-completeness">6. Soundness and completeness</h2>
<p><strong>Soundness</strong> If a distribution <script type="math/tex">P</script> factorizes according to a graph <script type="math/tex">G</script>,
then <script type="math/tex">I(G) \subseteq I(P</script>).</p>
<p><strong>Completeness</strong> d-separation detects all possible independences.</p>
<p>However, it is important to note that if <script type="math/tex">X</script> and <script type="math/tex">Y</script> are not d-separated
given <script type="math/tex">G</script>, then it is not the case that <script type="math/tex">X</script> and <script type="math/tex">Y</script> are dependent given
<script type="math/tex">Z</script> in all distributions that factorize over <script type="math/tex">G</script>. For example, consider
the graph <script type="math/tex">A \rightarrow B</script>. Clearly, <script type="math/tex">A</script> and <script type="math/tex">B</script> are dependent. Note
that every distribution over <script type="math/tex">A</script> and <script type="math/tex">B</script> factorizes according to this
graph, since it is always true that <script type="math/tex">P(A, B) = P(A)P(B\ |\ A)</script>. But if
we consider the specific distribution give in Table 1, then <script type="math/tex">A \perp B</script>.
However, we can assert that if <script type="math/tex">X</script> and <script type="math/tex">Y</script> are not d-separated given
<script type="math/tex">Z</script>, then there is at least one distribution which factorizes according
to the graph, and where <script type="math/tex">X</script> is not independent of <script type="math/tex">Y</script> given <script type="math/tex">Z</script>.
Combining this with the above theorems gives us an important result.</p>
<div class="imgcap">
<img src="/blog/assets/pgm/lecture2/assets/6distribution.png" style="border:none;width:30%" />
</div>
<p>Table 1: The distribution specified in this table factorizes according to the graph <script type="math/tex">A \rightarrow B</script> but <script type="math/tex">A</script> is independent of <script type="math/tex">B</script>.</p>
<h2 id="7-uniqueness-of-bn">7. Uniqueness of BN</h2>
<p>Very different BN graphs can actually be equivalent, in that they encode
precisely the same set of conditional independence assertions.For
example, the three networks in figure [fig:xyz_trail](a),(b),(c)
encode precisely the same independence assumption: <script type="math/tex">X\perp Y\ |\ Z</script>.
Note that the v-structure network in figure [fig:xyz_trail](d)
induces a very different set of d-separation assertions, and hence it
does not fall into the same I-equivalence class as the first three.</p>
<p><strong>Definition 7.1</strong> Two graph structures <script type="math/tex">K^1</script> and <script type="math/tex">K^2</script> over X are I-equivalent if
<script type="math/tex">I(K^1) = I(K^2)</script>. The set of all graphs over <script type="math/tex">X</script> is partitioned into a
set of mutually exclusive and exhaustive I-equivalence classes, which
are the set of equivalence classes induced by the I-equivalence
relation.<script type="math/tex"></script></p>
<p><strong>Definition 7.2</strong> The skeleton of a Bayesian network graph <script type="math/tex">\mathcal{G}</script> over <script type="math/tex">X</script> is an
undirected graph over <script type="math/tex">X</script> that contains an edge <script type="math/tex">\{X, Y\}</script> for every
edge <script type="math/tex">(X, Y)</script> in <script type="math/tex">\mathcal{G}</script>.</p>
<p><strong>Theorem 7.1</strong> Let <script type="math/tex">\mathcal{G^1}</script> and <script type="math/tex">\mathcal{G^2}</script> be two graphs over <script type="math/tex">X</script>. If
<script type="math/tex">\mathcal{G^1}</script> and <script type="math/tex">\mathcal{G^2}</script> have the same skeleton and the same
set of v-structures then they are I-equivalent.</p>
<h2 id="8-minimum-i-map">8. Minimum I-Map</h2>
<p>Complete graph is a trivial I-map for any distribution over all
variables, since it does not reveal any of the independence structure in
the distribution.</p>
<p><strong>Definition 8.1</strong> A graph <script type="math/tex">\mathcal{K}</script> is a minimal I-map for a set of independences
<script type="math/tex">\mathcal{I}</script> if it is an I-map for <script type="math/tex">\mathcal{I}</script>, and if the removal of
even a single edge from <script type="math/tex">\mathcal{K}</script> renders it not an I-map.</p>
<h2 id="9--perfect-maps">9. Perfect Maps</h2>
<p><strong>Definition 9.1</strong> We say that a graph <script type="math/tex">\mathcal{K}</script> is a perfect map (P-map) for a set of
independences <script type="math/tex">I</script> if we have that <script type="math/tex">I(\mathcal{K}) = I</script>. We say that
<script type="math/tex">\mathcal{K}</script> is a perfect map for <script type="math/tex">P</script> if <script type="math/tex">I(\mathcal{K}) = I(P)</script>.</p>
<p>Note that not every distribution has a perfect map.</p>
<h2 id="10-summary">10. Summary</h2>
<ul>
<li>
<p><strong>Definition 10.1</strong> <script type="math/tex">A</script> Bayesian network is a pair <script type="math/tex">B = (G, P)</script> where <script type="math/tex">P</script> factorizes
over <script type="math/tex">G</script>, and where <script type="math/tex">P</script> is specified as a set of CPDs associated
with <script type="math/tex">G’s</script> nodes. The distribution <script type="math/tex">P</script> is often annotated <script type="math/tex">P_B</script>.</p>
</li>
<li>
<p>BN utilizes local and global independences to give a compact
representation of the joint distribution.</p>
</li>
<li>
<p>Joint likelihood is computed by multiplying CPDs.</p>
</li>
<li>
<p>Local and global independences are identifiable via d-separation.</p>
</li>
</ul>Gavin Junjie Xinggavinxing9016@gmail.comMy scribe on lecture 2, CMU 10-708. In this blog, I'll introduce Directed GM, with theorems and definitions.Neural Networks from Scratch2017-08-07T00:00:00-07:002017-08-07T00:00:00-07:00https://xingjunjie.me/blog/posts/2017/08/07/Neural-Networks-from-Scratch<p>Hi there, I’m a junior student from Shanghai JiaoTong University(SJTU). In my sophomore year, I started to learn about machine learning with a try of using <em>Support Vector Machine (SVM)</em> to classify credit card digits, something like the <a href="http://yann.lecun.com/exdb/mnist/">MNIST TASK</a>. Then I joined ADAPT and began my research work on NLP.</p>
<p>Within the NLP domain, many statistic methods work quite well, giving astonishing results on the basic tasks, <em>chunking</em> (which is especially important in Chinese), <em>Part of Speech tagging (POS)</em> and <em>Named-Entity recognition (NER)</em>. Models like <em>Hidden Markov Model (HMM)</em> and <em>Conditional random fields (CRF)</em> can give you a glimpse.</p>
<p>These days, I was attracted by the awesome performance of Deep Learning in NLP, which utilizes <strong>Neural Networks</strong> Models. I’ll try my best to give an <strong>overlook</strong> over the Neural Networks stuff. On my own experience, the mathematic things sometimes distract me from the intuitive of how Neural Networks works. Thus, in this blog, there will only be some <strong>“baby math”</strong>, as <a href="https://nlp.stanford.edu/manning/">Prof. Chris Manning</a> named in <a href="http://web.stanford.edu/class/cs224n/index.html">CS224n</a>.</p>
<p>Part of this blog overlaps with the <a href="http://karpathy.github.io/neuralnets/">great blog</a> from Andrej Karpathy, which definitely worth reading. But I’ll try to give you something more concrete about <strong>backpropagation</strong> by implement a linear regression model, and illustrate some <strong>examples</strong> and <strong>applications</strong> of Neural Networks in <strong>NLP</strong>, including word embedding and character-level language model.</p>
<p>Let’s start!</p>
<h2 id="chapter-1-vanilla-neural-networks">Chapter 1: Vanilla Neural Networks</h2>
<p>At the first glance of the Neural Networks architecture, I wonder how it works so brilliantly. It seems non-sense that several layer of mathematic computation can simulate human brain, one of the most complicated creature in the world, even a little.</p>
<h3 id="human-neurons">Human Neurons</h3>
<p>Let’s turn to how our brain works to get an intuitive idea.</p>
<div class="imgcap">
<img src="/blog/assets/rnn/neuron.png" style="border:none;" />
<div class="thecap">
A brain neuron and its main components. Image credit: <a href="https://www.quora.com/What-is-an-intuitive-explanation-for-neural-networks">Quora</a>
</div>
</div>
<blockquote>
<p>Our brain has <strong>a large network of interlinked neurons</strong>, which act as a highway for information to be transmitted from point A to point B. When different information is sent from A to B, the brain <strong>activates different sets of neurons</strong>, and so essentially uses a different route to get from A to B.<br /><br />
At each neuron, dendrites receive incoming signals sent by other neurons. If the neuron <strong>receives a high enough level of signals</strong> within a certain period of time, the neuron sends an electrical pulse into the terminals. These <strong>outgoing signals</strong> are then received by other neurons.<br /><br />
Credit: <a href="https://www.quora.com/What-is-an-intuitive-explanation-for-neural-networks">Quora answer by Annalyn Ng</a> (Quote part of her answer, click if you are interested, which I recommend you to do.)</p>
</blockquote>
<h3 id="modeling-human-neural-network">Modeling Human Neural Network</h3>
<p>Let’s recap what we can learn from last section:</p>
<ul>
<li>Information is sent between neurons.</li>
<li>A neuron can be activated when received certain signal.</li>
</ul>
<p>Here I’ll introduce a vanilla Neural Networks Model, with a vector of length 8 as input, a hidden layer of 4 neurons, and a vector of length 3 as out put, the NN model can be trained as a image classifier for 3 tags.</p>
<div class="imgcap">
<img src="/blog/assets/rnn/vanilla-nn-activated.png" style="border:none;" />
<div class="thecap">
Vanilla Neural Network architecture, with one hidden layer
</div>
<div class="thecap">
You can image that if you feed a picture involving animal into the network, it can output a tag of "cat" or "dog".
</div>
<div class="thecap">
`W` and `b` are parameters of the model, which can be learned by training.
</div>
</div>
<p>As you can see in the picture, when facing <strong>an input of a cat picture</strong>, the first and third (from left) <strong>neurons are activated</strong> (red), and then, these neurons send signal to the output layer and <strong>get a “cat” output</strong>.</p>
<p>Thinking about our brain do the same thing <strong>when we see a cat</strong>, some <strong>neurons are triggered</strong> and we <strong>find out that it is a cat!</strong></p>
<p>This is the intuitive I see from the Neural Networks, though it is a simplified and idealized model just like most of the models in the world, it has the potential to improve machine intelligence.</p>
<blockquote>
<p>Practically, Neural Networks contains millions, even billions of <em>neurons</em>, each be sensitive to certain input, automatically learn features and representations, thus obtains intelligence.</p>
</blockquote>
<p>Till now, I believe that you have obtained an intuitive idea about how Neural Networks works. Then I’ll introduce how data flows from input to out, <strong>a feedforward process</strong>, and how the weight matrix and bias are trained, <strong>a backpropagation process</strong>.</p>
<h3 id="feedforward">Feedforward</h3>
<p>The feedforward process is straightforward, you can treat it as a function, feed it and it will give an output.</p>
<script type="math/tex; mode=display">y = NN(x)</script>
<h3 id="backpropagation">Backpropagation</h3>
<p>The backpropagation process is to tune the parameters, e.g. <code class="language-plaintext highlighter-rouge">W</code> and <code class="language-plaintext highlighter-rouge">b</code>, to fit the training data, minimize the total loss.
I’ll take a simple linear regression model as an example, to illustrate how backpropagation works with Stochastic Gradient Descent (SGD).</p>
<p>Consider a network take a scalar <code class="language-plaintext highlighter-rouge">x</code> as input, and output <code class="language-plaintext highlighter-rouge">y = ax + b</code>. If we train the model on training set:</p>
<script type="math/tex; mode=display">X = [1, 2, 3, 4]</script>
<script type="math/tex; mode=display">Y = [4, 3, 2, 1]</script>
<p>The model will soon fit the function <script type="math/tex">y = -x + 5</script>, with <em>quadratic loss function</em> and <em>SGD</em>.</p>
<p>Let’s take the derivative first!</p>
<script type="math/tex; mode=display">y = Wx + b ,\qquad
loss = \frac{1}{2} * {(y - y\_)}^{2}</script>
<script type="math/tex; mode=display">\frac{dloss}{dy} = y - y\_ ,\qquad
\frac{dy}{dw} = x,\qquad \frac{dy}{db} = 1</script>
<script type="math/tex; mode=display">\frac{dloss}{dw} = \frac{dloss}{dy} * \frac{dy}{dw} = (y - y\_) * x,\qquad
\frac{dloss}{db} = \frac{dloss}{dy} * \frac{dy}{db} = (y - y\_)</script>
<blockquote>
<p>As you may acknowledge, <strong>Chain Rule</strong> is the key tool we should utilize. Backpropagation propagates through the chain rule from back to front.</p>
</blockquote>
<p>Time to code! A toy implementation with pure Python:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">X</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">]</span> <span class="c1"># training set
</span><span class="n">Y</span> <span class="o">=</span> <span class="p">[</span><span class="mi">4</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">]</span>
<span class="n">w</span> <span class="o">=</span> <span class="mi">0</span> <span class="c1"># initiate parameter as 0
</span><span class="n">b</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">lr</span> <span class="o">=</span> <span class="mf">0.1</span> <span class="c1"># learning rate
</span><span class="k">assert</span> <span class="nb">len</span><span class="p">(</span><span class="n">X</span><span class="p">)</span> <span class="o">==</span> <span class="nb">len</span><span class="p">(</span><span class="n">Y</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">200</span><span class="p">):</span>
<span class="n">total_loss</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">X</span><span class="p">)):</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="n">j</span><span class="p">]</span>
<span class="n">y_</span> <span class="o">=</span> <span class="n">Y</span><span class="p">[</span><span class="n">j</span><span class="p">]</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">w</span> <span class="o">*</span> <span class="n">x</span> <span class="o">+</span> <span class="n">b</span> <span class="c1"># feed forward
</span> <span class="n">loss</span> <span class="o">=</span> <span class="p">(</span><span class="n">y</span> <span class="o">-</span> <span class="n">y_</span><span class="p">)</span><span class="o">**</span><span class="mi">2</span> <span class="o">/</span> <span class="mf">0.5</span>
<span class="n">total_loss</span> <span class="o">+=</span> <span class="n">loss</span> <span class="c1"># accumulate loss
</span>
<span class="n">dy</span> <span class="o">=</span> <span class="n">y</span> <span class="o">-</span> <span class="n">y_</span> <span class="c1"># calculate derivative
</span> <span class="n">dw</span> <span class="o">=</span> <span class="n">dy</span> <span class="o">*</span> <span class="n">x</span>
<span class="n">db</span> <span class="o">=</span> <span class="n">dy</span> <span class="o">*</span> <span class="mi">1</span>
<span class="n">w</span> <span class="o">-=</span> <span class="n">lr</span> <span class="o">*</span> <span class="n">dw</span> <span class="c1"># backpropagation
</span> <span class="n">b</span> <span class="o">-=</span> <span class="n">lr</span> <span class="o">*</span> <span class="n">db</span>
<span class="k">print</span><span class="p">(</span><span class="s">"After iteration {}, loss: {:.2f}. y = {:.2f}x + {:.2f}"</span><span class="o">.</span><span class="nb">format</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">total_loss</span><span class="p">,</span> <span class="n">w</span><span class="p">,</span> <span class="n">b</span><span class="p">))</span>
</code></pre></div></div>
<p>Output:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">After</span> <span class="n">iteration</span> <span class="mi">0</span><span class="p">,</span> <span class="n">loss</span><span class="p">:</span> <span class="mf">44.47</span><span class="o">.</span> <span class="n">y</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.10</span><span class="n">x</span> <span class="o">+</span> <span class="mf">0.34</span>
<span class="n">After</span> <span class="n">iteration</span> <span class="mi">1</span><span class="p">,</span> <span class="n">loss</span><span class="p">:</span> <span class="mf">39.21</span><span class="o">.</span> <span class="n">y</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.16</span><span class="n">x</span> <span class="o">+</span> <span class="mf">0.68</span>
<span class="n">After</span> <span class="n">iteration</span> <span class="mi">2</span><span class="p">,</span> <span class="n">loss</span><span class="p">:</span> <span class="mf">33.80</span><span class="o">.</span> <span class="n">y</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.22</span><span class="n">x</span> <span class="o">+</span> <span class="mf">0.99</span>
<span class="n">After</span> <span class="n">iteration</span> <span class="mi">3</span><span class="p">,</span> <span class="n">loss</span><span class="p">:</span> <span class="mf">29.14</span><span class="o">.</span> <span class="n">y</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.28</span><span class="n">x</span> <span class="o">+</span> <span class="mf">1.27</span>
<span class="n">After</span> <span class="n">iteration</span> <span class="mi">4</span><span class="p">,</span> <span class="n">loss</span><span class="p">:</span> <span class="mf">25.12</span><span class="o">.</span> <span class="n">y</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.33</span><span class="n">x</span> <span class="o">+</span> <span class="mf">1.54</span>
<span class="n">After</span> <span class="n">iteration</span> <span class="mi">5</span><span class="p">,</span> <span class="n">loss</span><span class="p">:</span> <span class="mf">21.66</span><span class="o">.</span> <span class="n">y</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.38</span><span class="n">x</span> <span class="o">+</span> <span class="mf">1.79</span>
<span class="n">After</span> <span class="n">iteration</span> <span class="mi">6</span><span class="p">,</span> <span class="n">loss</span><span class="p">:</span> <span class="mf">18.67</span><span class="o">.</span> <span class="n">y</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.42</span><span class="n">x</span> <span class="o">+</span> <span class="mf">2.02</span>
<span class="n">After</span> <span class="n">iteration</span> <span class="mi">7</span><span class="p">,</span> <span class="n">loss</span><span class="p">:</span> <span class="mf">16.10</span><span class="o">.</span> <span class="n">y</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.46</span><span class="n">x</span> <span class="o">+</span> <span class="mf">2.23</span>
<span class="n">After</span> <span class="n">iteration</span> <span class="mi">8</span><span class="p">,</span> <span class="n">loss</span><span class="p">:</span> <span class="mf">13.88</span><span class="o">.</span> <span class="n">y</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.50</span><span class="n">x</span> <span class="o">+</span> <span class="mf">2.43</span>
<span class="n">After</span> <span class="n">iteration</span> <span class="mi">9</span><span class="p">,</span> <span class="n">loss</span><span class="p">:</span> <span class="mf">11.96</span><span class="o">.</span> <span class="n">y</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.54</span><span class="n">x</span> <span class="o">+</span> <span class="mf">2.61</span>
<span class="n">After</span> <span class="n">iteration</span> <span class="mi">10</span><span class="p">,</span> <span class="n">loss</span><span class="p">:</span> <span class="mf">10.31</span><span class="o">.</span> <span class="n">y</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.57</span><span class="n">x</span> <span class="o">+</span> <span class="mf">2.78</span>
<span class="o">...</span>
<span class="o">...</span>
<span class="n">After</span> <span class="n">iteration</span> <span class="mi">51</span><span class="p">,</span> <span class="n">loss</span><span class="p">:</span> <span class="mf">0.02</span><span class="o">.</span> <span class="n">y</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.98</span><span class="n">x</span> <span class="o">+</span> <span class="mf">4.89</span>
<span class="n">After</span> <span class="n">iteration</span> <span class="mi">52</span><span class="p">,</span> <span class="n">loss</span><span class="p">:</span> <span class="mf">0.02</span><span class="o">.</span> <span class="n">y</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.98</span><span class="n">x</span> <span class="o">+</span> <span class="mf">4.90</span>
<span class="n">After</span> <span class="n">iteration</span> <span class="mi">53</span><span class="p">,</span> <span class="n">loss</span><span class="p">:</span> <span class="mf">0.02</span><span class="o">.</span> <span class="n">y</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.98</span><span class="n">x</span> <span class="o">+</span> <span class="mf">4.91</span>
<span class="n">After</span> <span class="n">iteration</span> <span class="mi">54</span><span class="p">,</span> <span class="n">loss</span><span class="p">:</span> <span class="mf">0.02</span><span class="o">.</span> <span class="n">y</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.98</span><span class="n">x</span> <span class="o">+</span> <span class="mf">4.92</span>
<span class="n">After</span> <span class="n">iteration</span> <span class="mi">55</span><span class="p">,</span> <span class="n">loss</span><span class="p">:</span> <span class="mf">0.01</span><span class="o">.</span> <span class="n">y</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.98</span><span class="n">x</span> <span class="o">+</span> <span class="mf">4.92</span>
<span class="n">After</span> <span class="n">iteration</span> <span class="mi">56</span><span class="p">,</span> <span class="n">loss</span><span class="p">:</span> <span class="mf">0.01</span><span class="o">.</span> <span class="n">y</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.99</span><span class="n">x</span> <span class="o">+</span> <span class="mf">4.93</span>
<span class="n">After</span> <span class="n">iteration</span> <span class="mi">57</span><span class="p">,</span> <span class="n">loss</span><span class="p">:</span> <span class="mf">0.01</span><span class="o">.</span> <span class="n">y</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.99</span><span class="n">x</span> <span class="o">+</span> <span class="mf">4.93</span>
<span class="n">After</span> <span class="n">iteration</span> <span class="mi">58</span><span class="p">,</span> <span class="n">loss</span><span class="p">:</span> <span class="mf">0.01</span><span class="o">.</span> <span class="n">y</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.99</span><span class="n">x</span> <span class="o">+</span> <span class="mf">4.94</span>
<span class="n">After</span> <span class="n">iteration</span> <span class="mi">59</span><span class="p">,</span> <span class="n">loss</span><span class="p">:</span> <span class="mf">0.01</span><span class="o">.</span> <span class="n">y</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.99</span><span class="n">x</span> <span class="o">+</span> <span class="mf">4.94</span>
<span class="n">After</span> <span class="n">iteration</span> <span class="mi">60</span><span class="p">,</span> <span class="n">loss</span><span class="p">:</span> <span class="mf">0.01</span><span class="o">.</span> <span class="n">y</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.99</span><span class="n">x</span> <span class="o">+</span> <span class="mf">4.95</span>
<span class="o">...</span>
<span class="o">...</span>
</code></pre></div></div>
<p>Let’s recap what we have learned:</p>
<ul>
<li>An intuitive idea of how Neural Networks models human brain and how it works</li>
<li>Feedforward process acts like a simple <em>function</em></li>
<li>Backpropagation utilizes <strong>Chain Rule</strong> to calculate derivative of each parameter, updates the parameters and minimize the loss</li>
</ul>
<h3 id="application-word-vector">Application: Word Vector</h3>
<p>It is hard for us to encode words so that computer can use and meanwhile, keep the “meaning”. A common way to utilize the “meaning” is to build a synonym set or hypernyms (is-a) relationship set, like <em>WordNet</em>. It’s useful but largely limited, limited to the relation set and the vocabulary set.</p>
<p>If we regard words as atomic symbols, we can use <em>one-hot representation</em> to encode all the words. Such representation also suffers lack of “meaning” and flexibility (influence all words when adding new words), and memory usage (13M words in Google News 1T corpora). The inner production of 2 different word vector is always <code class="language-plaintext highlighter-rouge">0</code>, means nothing.</p>
<p>However, researchers came up with an idea that we can get the meaning of a word by its neighbors.</p>
<blockquote>
<p>“You shall know a word by the company it keeps” (J. R. Firth 1957: 11)</p>
</blockquote>
<p>It reminds me of the years I was new to English. As a non-native speaker, I always look up in a dictionary when meeting a unknown word, but teachers said that “Never look up at first glance! <em>Guess its meaning from the context first!</em>”.</p>
<p>Here’s an example about word and context from CS224n:</p>
<blockquote>
<p>government debt problems turning into <strong>banking</strong> crises as has happened in <br />
saying that Europe needs unified <strong>banking</strong> regulation to replace the hodgepodge</p>
</blockquote>
<p>The words in the context represent <em>“banking”</em> !</p>
<p>Recent years, with the proposition of <a href="https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf">Mikolov et al.</a>, distributed representation of word can be trained fast and successfully captures word’s semantic meaning.Till now, there are 3 main models to train word vectors:</p>
<ol>
<li><a href="https://code.google.com/archive/p/word2vec/">word2vec</a></li>
<li><a href="https://nlp.stanford.edu/projects/glove/">GloVe</a></li>
<li><a href="https://github.com/facebookresearch/fastText">FastText</a></li>
</ol>
<p>I’ll introduce the word2vec model, for these 3 models are almost the same. GloVe takes statistic feature into account, and FastText predicts tag instead of words. I won’t go detail to the training tricks, such as negative sampling, hierarchical-softmax, to name just a few.</p>
<p>The key idea of word2vec is:</p>
<blockquote>
<p>Predict between <strong>every word</strong> and its <strong>context words</strong>.</p>
</blockquote>
<p>So obviously there are two algorithms:</p>
<ol>
<li><strong>Skip-Gram (SG)</strong>: Predict context words given target word</li>
<li><strong>Continuous-Bag-Of-Words (CBOW)</strong>: Predict target word given context words</li>
</ol>
<div class="imgcap">
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" id="processonSvg1000" viewBox="130.5 233.0 773.5 433.0" width="773.5" height="433.0"><defs id="ProcessOnDefs1001" /><g id="ProcessOnG1002"><path id="ProcessOnPath1003" d="M130.5 233.0H904.0V666.0H130.5V233.0Z" fill="none" /><g id="ProcessOnG1004"><g id="ProcessOnG1005" transform="matrix(1.0,0.0,0.0,1.0,169.0,379.0)" opacity="1.0"><path id="ProcessOnPath1006" d="M0.0 0.0L307.0 0.0L307.0 36.0L0.0 36.0Z" stroke="#323232" stroke-width="2.0" stroke-dasharray="none" opacity="1.0" fill="#ffffff" /></g><g id="ProcessOnG1007" transform="matrix(1.0,0.0,0.0,1.0,169.0,460.0)" opacity="1.0"><path id="ProcessOnPath1008" d="M0.0 0.0L307.0 0.0L307.0 36.0L0.0 36.0Z" stroke="#323232" stroke-width="2.0" stroke-dasharray="none" opacity="1.0" fill="#ffffff" /></g><g id="ProcessOnG1009" transform="matrix(1.0,0.0,0.0,1.0,150.5,537.0)" opacity="1.0"><path id="ProcessOnPath1010" d="M0.0 0.0L86.0 0.0L86.0 31.0L0.0 31.0Z" stroke="none" stroke-width="0.0" stroke-dasharray="none" opacity="1.0" fill="none" /><g id="ProcessOnG1011" transform="matrix(1.0,0.0,0.0,1.0,0.0,4.25)"><text id="ProcessOnText1012" fill="#000000" font-size="18" x="42.0" y="18.45" font-family="微软雅黑" font-weight="normal" font-style="normal" text-decoration="none" family="微软雅黑" text-anchor="middle" size="18">selling</text></g></g><g id="ProcessOnG1013" transform="matrix(1.0,0.0,0.0,1.0,236.5,537.0)" opacity="1.0"><path id="ProcessOnPath1014" d="M0.0 0.0L86.0 0.0L86.0 31.0L0.0 31.0Z" stroke="none" stroke-width="0.0" stroke-dasharray="none" opacity="1.0" fill="none" /><g id="ProcessOnG1015" transform="matrix(1.0,0.0,0.0,1.0,0.0,4.25)"><text id="ProcessOnText1016" fill="#000000" font-size="18" x="42.0" y="18.45" font-family="微软雅黑" font-weight="normal" font-style="normal" text-decoration="none" family="微软雅黑" text-anchor="middle" size="18">these</text></g></g><g id="ProcessOnG1017" transform="matrix(1.0,0.0,0.0,1.0,322.5,537.0)" opacity="1.0"><path id="ProcessOnPath1018" d="M0.0 0.0L86.0 0.0L86.0 31.0L0.0 31.0Z" stroke="none" stroke-width="0.0" stroke-dasharray="none" opacity="1.0" fill="none" /><g id="ProcessOnG1019" transform="matrix(1.0,0.0,0.0,1.0,0.0,4.25)"><text id="ProcessOnText1020" fill="#000000" font-size="18" x="42.0" y="18.45" font-family="微软雅黑" font-weight="normal" font-style="normal" text-decoration="none" family="微软雅黑" text-anchor="middle" size="18">leather</text></g></g><g id="ProcessOnG1021" transform="matrix(1.0,0.0,0.0,1.0,408.5,537.0)" opacity="1.0"><path id="ProcessOnPath1022" d="M0.0 0.0L86.0 0.0L86.0 31.0L0.0 31.0Z" stroke="none" stroke-width="0.0" stroke-dasharray="none" opacity="1.0" fill="none" /><g id="ProcessOnG1023" transform="matrix(1.0,0.0,0.0,1.0,0.0,4.25)"><text id="ProcessOnText1024" fill="#000000" font-size="18" x="42.0" y="18.45" font-family="微软雅黑" font-weight="normal" font-style="normal" text-decoration="none" family="微软雅黑" text-anchor="middle" size="18">jackets</text></g></g><g id="ProcessOnG1025" transform="matrix(1.0,0.0,0.0,1.0,312.5,415.0)" opacity="1.0"><path id="ProcessOnPath1026" d="M10.0 0.0L20.0 10.0L13.4 10.0L13.4 45.0L6.6000000000000005 45.0L6.6000000000000005 10.0L0.0 10.0L10.0 0.0Z" stroke="#323232" stroke-width="2.0" stroke-dasharray="none" opacity="1.0" fill="#ffffff" /></g><g id="ProcessOnG1027" transform="matrix(1.0,0.0,0.0,1.0,183.5,496.0)" opacity="1.0"><path id="ProcessOnPath1028" d="M10.0 0.0L20.0 10.0L13.4 10.0L13.4 45.0L6.6000000000000005 45.0L6.6000000000000005 10.0L0.0 10.0L10.0 0.0Z" stroke="#323232" stroke-width="2.0" stroke-dasharray="none" opacity="1.0" fill="#ffffff" /></g><g id="ProcessOnG1029" transform="matrix(1.0,0.0,0.0,1.0,269.5,496.0)" opacity="1.0"><path id="ProcessOnPath1030" d="M10.0 0.0L20.0 10.0L13.4 10.0L13.4 45.0L6.6000000000000005 45.0L6.6000000000000005 10.0L0.0 10.0L10.0 0.0Z" stroke="#323232" stroke-width="2.0" stroke-dasharray="none" opacity="1.0" fill="#ffffff" /></g><g id="ProcessOnG1031" transform="matrix(1.0,0.0,0.0,1.0,355.5,496.0)" opacity="1.0"><path id="ProcessOnPath1032" d="M10.0 0.0L20.0 10.0L13.4 10.0L13.4 45.0L6.6000000000000005 45.0L6.6000000000000005 10.0L0.0 10.0L10.0 0.0Z" stroke="#323232" stroke-width="2.0" stroke-dasharray="none" opacity="1.0" fill="#ffffff" /></g><g id="ProcessOnG1033" transform="matrix(1.0,0.0,0.0,1.0,441.5,496.0)" opacity="1.0"><path id="ProcessOnPath1034" d="M10.0 0.0L20.0 10.0L13.4 10.0L13.4 45.0L6.6000000000000005 45.0L6.6000000000000005 10.0L0.0 10.0L10.0 0.0Z" stroke="#323232" stroke-width="2.0" stroke-dasharray="none" opacity="1.0" fill="#ffffff" /></g><g id="ProcessOnG1035" transform="matrix(1.0,0.0,0.0,1.0,312.5,334.0)" opacity="1.0"><path id="ProcessOnPath1036" d="M10.0 0.0L20.0 10.0L13.4 10.0L13.4 45.0L6.6000000000000005 45.0L6.6000000000000005 10.0L0.0 10.0L10.0 0.0Z" stroke="#323232" stroke-width="2.0" stroke-dasharray="none" opacity="1.0" fill="#ffffff" /></g><g id="ProcessOnG1037" transform="matrix(1.0,0.0,0.0,1.0,242.5,299.0)" opacity="1.0"><path id="ProcessOnPath1038" d="M0.0 0.0L160.0 0.0L160.0 40.0L0.0 40.0Z" stroke="none" stroke-width="0.0" stroke-dasharray="none" opacity="1.0" fill="none" /><g id="ProcessOnG1039" transform="matrix(1.0,0.0,0.0,1.0,0.0,8.125)"><text id="ProcessOnText1040" fill="#000000" font-size="19" x="79.0" y="19.475" font-family="微软雅黑" font-weight="normal" font-style="normal" text-decoration="none" family="微软雅黑" text-anchor="middle" size="19">fine</text></g></g><g id="ProcessOnG1041" transform="matrix(1.0,0.0,0.0,1.0,242.5,253.0)" opacity="1.0"><path id="ProcessOnPath1042" d="M0.0 0.0L160.0 0.0L160.0 40.0L0.0 40.0Z" stroke="none" stroke-width="0.0" stroke-dasharray="none" opacity="1.0" fill="none" /><g id="ProcessOnG1043" transform="matrix(1.0,0.0,0.0,1.0,0.0,7.5)"><text id="ProcessOnText1044" fill="#000000" font-size="20" x="79.0" y="20.5" font-family="微软雅黑" font-weight="normal" font-style="normal" text-decoration="none" family="微软雅黑" text-anchor="middle" size="20">CBOW</text></g></g><g id="ProcessOnG1045" transform="matrix(1.0,0.0,0.0,1.0,540.0,379.0)" opacity="1.0"><path id="ProcessOnPath1046" d="M0.0 0.0L72.0 0.0L72.0 36.0L0.0 36.0Z" stroke="#323232" stroke-width="2.0" stroke-dasharray="none" opacity="1.0" fill="#ffffff" /></g><g id="ProcessOnG1047" transform="matrix(1.0,0.0,0.0,1.0,628.0,379.0)" opacity="1.0"><path id="ProcessOnPath1048" d="M0.0 0.0L72.0 0.0L72.0 36.0L0.0 36.0Z" stroke="#323232" stroke-width="2.0" stroke-dasharray="none" opacity="1.0" fill="#ffffff" /></g><g id="ProcessOnG1049" transform="matrix(1.0,0.0,0.0,1.0,717.0,379.0)" opacity="1.0"><path id="ProcessOnPath1050" d="M0.0 0.0L72.0 0.0L72.0 36.0L0.0 36.0Z" stroke="#323232" stroke-width="2.0" stroke-dasharray="none" opacity="1.0" fill="#ffffff" /></g><g id="ProcessOnG1051" transform="matrix(1.0,0.0,0.0,1.0,805.0,379.0)" opacity="1.0"><path id="ProcessOnPath1052" d="M0.0 0.0L72.0 0.0L72.0 36.0L0.0 36.0Z" stroke="#323232" stroke-width="2.0" stroke-dasharray="none" opacity="1.0" fill="#ffffff" /></g><g id="ProcessOnG1053" transform="matrix(1.0,0.0,0.0,1.0,540.0,460.0)" opacity="1.0"><path id="ProcessOnPath1054" d="M0.0 0.0L72.0 0.0L72.0 36.0L0.0 36.0Z" stroke="#323232" stroke-width="2.0" stroke-dasharray="none" opacity="1.0" fill="#ffffff" /></g><g id="ProcessOnG1055" transform="matrix(1.0,0.0,0.0,1.0,628.0,460.0)" opacity="1.0"><path id="ProcessOnPath1056" d="M0.0 0.0L72.0 0.0L72.0 36.0L0.0 36.0Z" stroke="#323232" stroke-width="2.0" stroke-dasharray="none" opacity="1.0" fill="#ffffff" /></g><g id="ProcessOnG1057" transform="matrix(1.0,0.0,0.0,1.0,717.0,460.0)" opacity="1.0"><path id="ProcessOnPath1058" d="M0.0 0.0L72.0 0.0L72.0 36.0L0.0 36.0Z" stroke="#323232" stroke-width="2.0" stroke-dasharray="none" opacity="1.0" fill="#ffffff" /></g><g id="ProcessOnG1059" transform="matrix(1.0,0.0,0.0,1.0,805.0,460.0)" opacity="1.0"><path id="ProcessOnPath1060" d="M0.0 0.0L72.0 0.0L72.0 36.0L0.0 36.0Z" stroke="#323232" stroke-width="2.0" stroke-dasharray="none" opacity="1.0" fill="#ffffff" /></g><g id="ProcessOnG1061" transform="matrix(1.0,0.0,0.0,1.0,566.0,415.0)" opacity="1.0"><path id="ProcessOnPath1062" d="M10.0 0.0L20.0 10.0L13.4 10.0L13.4 45.0L6.6000000000000005 45.0L6.6000000000000005 10.0L0.0 10.0L10.0 0.0Z" stroke="#323232" stroke-width="2.0" stroke-dasharray="none" opacity="1.0" fill="#ffffff" /></g><g id="ProcessOnG1063" transform="matrix(1.0,0.0,0.0,1.0,654.0,415.0)" opacity="1.0"><path id="ProcessOnPath1064" d="M10.0 0.0L20.0 10.0L13.4 10.0L13.4 45.0L6.6000000000000005 45.0L6.6000000000000005 10.0L0.0 10.0L10.0 0.0Z" stroke="#323232" stroke-width="2.0" stroke-dasharray="none" opacity="1.0" fill="#ffffff" /></g><g id="ProcessOnG1065" transform="matrix(1.0,0.0,0.0,1.0,743.0,415.0)" opacity="1.0"><path id="ProcessOnPath1066" d="M10.0 0.0L20.0 10.0L13.4 10.0L13.4 45.0L6.6000000000000005 45.0L6.6000000000000005 10.0L0.0 10.0L10.0 0.0Z" stroke="#323232" stroke-width="2.0" stroke-dasharray="none" opacity="1.0" fill="#ffffff" /></g><g id="ProcessOnG1067" transform="matrix(1.0,0.0,0.0,1.0,831.0,415.0)" opacity="1.0"><path id="ProcessOnPath1068" d="M10.0 0.0L20.0 10.0L13.4 10.0L13.4 45.0L6.6000000000000005 45.0L6.6000000000000005 10.0L0.0 10.0L10.0 0.0Z" stroke="#323232" stroke-width="2.0" stroke-dasharray="none" opacity="1.0" fill="#ffffff" /></g><g id="ProcessOnG1069" transform="matrix(1.0,0.0,0.0,1.0,533.0,537.0)" opacity="1.0"><path id="ProcessOnPath1070" d="M0.0 0.0L86.0 0.0L86.0 31.0L0.0 31.0Z" stroke="none" stroke-width="0.0" stroke-dasharray="none" opacity="1.0" fill="none" /><g id="ProcessOnG1071" transform="matrix(1.0,0.0,0.0,1.0,0.0,4.25)"><text id="ProcessOnText1072" fill="#000000" font-size="18" x="42.0" y="18.45" font-family="微软雅黑" font-weight="normal" font-style="normal" text-decoration="none" family="微软雅黑" text-anchor="middle" size="18">fine</text></g></g><g id="ProcessOnG1073" transform="matrix(1.0,0.0,0.0,1.0,619.0,537.0)" opacity="1.0"><path id="ProcessOnPath1074" d="M0.0 0.0L86.0 0.0L86.0 31.0L0.0 31.0Z" stroke="none" stroke-width="0.0" stroke-dasharray="none" opacity="1.0" fill="none" /><g id="ProcessOnG1075" transform="matrix(1.0,0.0,0.0,1.0,0.0,4.25)"><text id="ProcessOnText1076" fill="#000000" font-size="18" x="42.0" y="18.45" font-family="微软雅黑" font-weight="normal" font-style="normal" text-decoration="none" family="微软雅黑" text-anchor="middle" size="18">fine</text></g></g><g id="ProcessOnG1077" transform="matrix(1.0,0.0,0.0,1.0,710.0,537.0)" opacity="1.0"><path id="ProcessOnPath1078" d="M0.0 0.0L86.0 0.0L86.0 31.0L0.0 31.0Z" stroke="none" stroke-width="0.0" stroke-dasharray="none" opacity="1.0" fill="none" /><g id="ProcessOnG1079" transform="matrix(1.0,0.0,0.0,1.0,0.0,4.25)"><text id="ProcessOnText1080" fill="#000000" font-size="18" x="42.0" y="18.45" font-family="微软雅黑" font-weight="normal" font-style="normal" text-decoration="none" family="微软雅黑" text-anchor="middle" size="18">fine</text></g></g><g id="ProcessOnG1081" transform="matrix(1.0,0.0,0.0,1.0,798.0,537.0)" opacity="1.0"><path id="ProcessOnPath1082" d="M0.0 0.0L86.0 0.0L86.0 31.0L0.0 31.0Z" stroke="none" stroke-width="0.0" stroke-dasharray="none" opacity="1.0" fill="none" /><g id="ProcessOnG1083" transform="matrix(1.0,0.0,0.0,1.0,0.0,4.25)"><text id="ProcessOnText1084" fill="#000000" font-size="18" x="42.0" y="18.45" font-family="微软雅黑" font-weight="normal" font-style="normal" text-decoration="none" family="微软雅黑" text-anchor="middle" size="18">fine</text></g></g><g id="ProcessOnG1085" transform="matrix(1.0,0.0,0.0,1.0,566.0,496.0)" opacity="1.0"><path id="ProcessOnPath1086" d="M10.0 0.0L20.0 10.0L13.4 10.0L13.4 45.0L6.6000000000000005 45.0L6.6000000000000005 10.0L0.0 10.0L10.0 0.0Z" stroke="#323232" stroke-width="2.0" stroke-dasharray="none" opacity="1.0" fill="#ffffff" /></g><g id="ProcessOnG1087" transform="matrix(1.0,0.0,0.0,1.0,654.0,496.0)" opacity="1.0"><path id="ProcessOnPath1088" d="M10.0 0.0L20.0 10.0L13.4 10.0L13.4 45.0L6.6000000000000005 45.0L6.6000000000000005 10.0L0.0 10.0L10.0 0.0Z" stroke="#323232" stroke-width="2.0" stroke-dasharray="none" opacity="1.0" fill="#ffffff" /></g><g id="ProcessOnG1089" transform="matrix(1.0,0.0,0.0,1.0,743.0,496.0)" opacity="1.0"><path id="ProcessOnPath1090" d="M10.0 0.0L20.0 10.0L13.4 10.0L13.4 45.0L6.6000000000000005 45.0L6.6000000000000005 10.0L0.0 10.0L10.0 0.0Z" stroke="#323232" stroke-width="2.0" stroke-dasharray="none" opacity="1.0" fill="#ffffff" /></g><g id="ProcessOnG1091" transform="matrix(1.0,0.0,0.0,1.0,831.0,496.0)" opacity="1.0"><path id="ProcessOnPath1092" d="M10.0 0.0L20.0 10.0L13.4 10.0L13.4 45.0L6.6000000000000005 45.0L6.6000000000000005 10.0L0.0 10.0L10.0 0.0Z" stroke="#323232" stroke-width="2.0" stroke-dasharray="none" opacity="1.0" fill="#ffffff" /></g><g id="ProcessOnG1093" transform="matrix(1.0,0.0,0.0,1.0,533.0,303.0)" opacity="1.0"><path id="ProcessOnPath1094" d="M0.0 0.0L86.0 0.0L86.0 31.0L0.0 31.0Z" stroke="none" stroke-width="0.0" stroke-dasharray="none" opacity="1.0" fill="none" /><g id="ProcessOnG1095" transform="matrix(1.0,0.0,0.0,1.0,0.0,4.25)"><text id="ProcessOnText1096" fill="#000000" font-size="18" x="42.0" y="18.45" font-family="微软雅黑" font-weight="normal" font-style="normal" text-decoration="none" family="微软雅黑" text-anchor="middle" size="18">selling</text></g></g><g id="ProcessOnG1097" transform="matrix(1.0,0.0,0.0,1.0,619.0,303.0)" opacity="1.0"><path id="ProcessOnPath1098" d="M0.0 0.0L86.0 0.0L86.0 31.0L0.0 31.0Z" stroke="none" stroke-width="0.0" stroke-dasharray="none" opacity="1.0" fill="none" /><g id="ProcessOnG1099" transform="matrix(1.0,0.0,0.0,1.0,0.0,4.25)"><text id="ProcessOnText1100" fill="#000000" font-size="18" x="42.0" y="18.45" font-family="微软雅黑" font-weight="normal" font-style="normal" text-decoration="none" family="微软雅黑" text-anchor="middle" size="18">these</text></g></g><g id="ProcessOnG1101" transform="matrix(1.0,0.0,0.0,1.0,705.0,303.0)" opacity="1.0"><path id="ProcessOnPath1102" d="M0.0 0.0L86.0 0.0L86.0 31.0L0.0 31.0Z" stroke="none" stroke-width="0.0" stroke-dasharray="none" opacity="1.0" fill="none" /><g id="ProcessOnG1103" transform="matrix(1.0,0.0,0.0,1.0,0.0,4.25)"><text id="ProcessOnText1104" fill="#000000" font-size="18" x="42.0" y="18.45" font-family="微软雅黑" font-weight="normal" font-style="normal" text-decoration="none" family="微软雅黑" text-anchor="middle" size="18">leather</text></g></g><g id="ProcessOnG1105" transform="matrix(1.0,0.0,0.0,1.0,791.0,303.0)" opacity="1.0"><path id="ProcessOnPath1106" d="M0.0 0.0L86.0 0.0L86.0 31.0L0.0 31.0Z" stroke="none" stroke-width="0.0" stroke-dasharray="none" opacity="1.0" fill="none" /><g id="ProcessOnG1107" transform="matrix(1.0,0.0,0.0,1.0,0.0,4.25)"><text id="ProcessOnText1108" fill="#000000" font-size="18" x="42.0" y="18.45" font-family="微软雅黑" font-weight="normal" font-style="normal" text-decoration="none" family="微软雅黑" text-anchor="middle" size="18">jackets</text></g></g><g id="ProcessOnG1109" transform="matrix(1.0,0.0,0.0,1.0,566.0,334.0)" opacity="1.0"><path id="ProcessOnPath1110" d="M10.0 0.0L20.0 10.0L13.4 10.0L13.4 45.0L6.6000000000000005 45.0L6.6000000000000005 10.0L0.0 10.0L10.0 0.0Z" stroke="#323232" stroke-width="2.0" stroke-dasharray="none" opacity="1.0" fill="#ffffff" /></g><g id="ProcessOnG1111" transform="matrix(1.0,0.0,0.0,1.0,652.0,334.0)" opacity="1.0"><path id="ProcessOnPath1112" d="M10.0 0.0L20.0 10.0L13.4 10.0L13.4 45.0L6.6000000000000005 45.0L6.6000000000000005 10.0L0.0 10.0L10.0 0.0Z" stroke="#323232" stroke-width="2.0" stroke-dasharray="none" opacity="1.0" fill="#ffffff" /></g><g id="ProcessOnG1113" transform="matrix(1.0,0.0,0.0,1.0,743.0,334.0)" opacity="1.0"><path id="ProcessOnPath1114" d="M10.0 0.0L20.0 10.0L13.4 10.0L13.4 45.0L6.6000000000000005 45.0L6.6000000000000005 10.0L0.0 10.0L10.0 0.0Z" stroke="#323232" stroke-width="2.0" stroke-dasharray="none" opacity="1.0" fill="#ffffff" /></g><g id="ProcessOnG1115" transform="matrix(1.0,0.0,0.0,1.0,824.0,334.0)" opacity="1.0"><path id="ProcessOnPath1116" d="M10.0 0.0L20.0 10.0L13.4 10.0L13.4 45.0L6.6000000000000005 45.0L6.6000000000000005 10.0L0.0 10.0L10.0 0.0Z" stroke="#323232" stroke-width="2.0" stroke-dasharray="none" opacity="1.0" fill="#ffffff" /></g><g id="ProcessOnG1117" transform="matrix(1.0,0.0,0.0,1.0,624.0,253.0)" opacity="1.0"><path id="ProcessOnPath1118" d="M0.0 0.0L160.0 0.0L160.0 40.0L0.0 40.0Z" stroke="none" stroke-width="0.0" stroke-dasharray="none" opacity="1.0" fill="none" /><g id="ProcessOnG1119" transform="matrix(1.0,0.0,0.0,1.0,0.0,7.5)"><text id="ProcessOnText1120" fill="#000000" font-size="20" x="79.0" y="20.5" font-family="微软雅黑" font-weight="normal" font-style="normal" text-decoration="none" family="微软雅黑" text-anchor="middle" size="20">SKIP-GRAM</text></g></g><g id="ProcessOnG1121" transform="matrix(1.0,0.0,0.0,1.0,340.0,600.0)" opacity="1.0"><path id="ProcessOnPath1122" d="M0.0 0.0L365.0 0.0L365.0 46.0L0.0 46.0Z" stroke="none" stroke-width="0.0" stroke-dasharray="none" opacity="1.0" fill="none" /><g id="ProcessOnG1123" transform="matrix(1.0,0.0,0.0,1.0,0.0,11.75)"><text id="ProcessOnText1124" fill="#000000" font-size="18" x="181.5" y="18.45" font-family="微软雅黑" font-weight="normal" font-style="normal" text-decoration="none" family="微软雅黑" text-anchor="middle" size="18">I am selling these fine leather jackets</text></g></g></g></g></svg>
<div class="thecap">
CBOW vs. SKIP-GRAM
</div>
</div>
<p>As shown in the diagram, CBOW takes context word as input and predicts the target word, SG generates the context-target pairs and predicts one context word given the target word.</p>
<p>It’s time to dive deep into the skip-gram model. The diagram below is a clear representation from CS224n.</p>
<div class="imgcap">
<img src="/blog/assets/rnn/skip-gram.png" style="border:none" />
<div class="thecap">
Skiip-Gram model
</div>
<div class="thecap">
Image credit: CS224n
</div>
</div>
<p>Define the notation:</p>
<ul>
<li><script type="math/tex">V</script> vocabulary size</li>
<li><script type="math/tex">d</script> dimension of word embedding</li>
<li><script type="math/tex">w_\epsilon</script> one-hot representation of the word</li>
<li><script type="math/tex">W</script> word embedding matrix</li>
<li><script type="math/tex">V_c</script> word vector of a word</li>
<li><script type="math/tex">p(x{\vert}c)</script> the probability of context word x given center/target word c</li>
</ul>
<p>We take one-hot representation <script type="math/tex">w_\epsilon</script> to encode a word in dictionary, then look up in the word embedding matrix <script type="math/tex">W</script> for the representation <script type="math/tex">V_c</script>, using the dot production <script type="math/tex">V_c = Ww_\epsilon</script>. After that, another dot production <script type="math/tex">W^{'}V_c</script> is used to calculate the hidden representation of output word, then utilize <script type="math/tex">softmax</script> to get the <code class="language-plaintext highlighter-rouge">probability representation</code> of output word. In the training, we have the truth answer, so we can calculate the loss and then backprop to tune the model parameters (<script type="math/tex">W, W^{'}</script>).</p>
<p>Given a vector <script type="math/tex">x = [x_0, x_1, ..., x_{n-1}]</script>:</p>
<script type="math/tex; mode=display">softmax{(x)}_i = \frac {e^{x_i}} {\sum_{j} e^{x^j}}</script>
<p>Well, a little more explanation of the diagram. The <code class="language-plaintext highlighter-rouge">3</code> vectors at the end of the model represents all the context words for one center/target word. It stands for the several context-target pair we generates from the training corpora, doesn’t mean that we predicts several context words on one feedforward.</p>
<p>If you are interested in how to code a word2vec model, here is a <a href="https://gist.github.com/GavinXing/9954ea846072e115bb07d9758892382c">toy example</a> with PyTorch.</p>
<h2 id="chapter-2-recurrent-neural-networks">Chapter 2: Recurrent Neural Networks</h2>
<p>TODO</p>Gavin Junjie Xinggavinxing9016@gmail.comWe'll go through some basic ideas about NNs and RNNs, the intuitive and applications