<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>麦子麦--DBWinds</title>
	<atom:link href="http://www.wzxue.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.wzxue.com</link>
	<description>DB，OS，the life，follow datas over the world!</description>
	<lastBuildDate>Mon, 30 Apr 2012 09:44:19 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	
		<item>
		<title>快速理解libevent内在机制</title>
		<link>http://www.wzxue.com/%e5%bf%ab%e9%80%9f%e7%90%86%e8%a7%a3libevent%e5%86%85%e5%9c%a8%e6%9c%ba%e5%88%b6/</link>
		<comments>http://www.wzxue.com/%e5%bf%ab%e9%80%9f%e7%90%86%e8%a7%a3libevent%e5%86%85%e5%9c%a8%e6%9c%ba%e5%88%b6/#comments</comments>
		<pubDate>Mon, 30 Apr 2012 09:44:19 +0000</pubDate>
		<dc:creator>麦子麦</dc:creator>
				<category><![CDATA[网络编程]]></category>
		<category><![CDATA[libevent]]></category>

		<guid isPermaLink="false">http://www.wzxue.com/?p=526</guid>
		<description><![CDATA[libevent 已经更新到2.0版本。 以前的很多函数已经被放弃，像event_set()，event_init(),event_loop()等等。 libevent主要有三个结构 struct event : event的核心结构，即用户的callback函数绑定到event结构上，当发生一定I/O条件或singal或超时，callback被调用。 struct event_base:event被绑定到event_base,event_base是底层I/O复用函数的包装，它负责管理各种注册的event结构和调度。 struct event_config: 为了避免event_base的构造器太复杂，提供了event_config结构对配置参数进行了包装，设置features和flags。 &#160; struct event_base相当于底层I/O复用函数的实例，每个底层I/O复用函数都定义了add,del,dispatch,dealloc等函数，提供了统一的接口供上层调用。event_base中有一个指针指向了某个底层I/O复用函数封装的结构struct eventop. 每次调用event_base的注册事件就相当于调用了某个底层I/O复用函数定义的接口函数。 event_base还保存了若干个event chains，比如active events链，所有events链和pending events链.还有一个小根的时间堆，负责对timeout callback的处理，每次从时间堆中提取最小timeout，放到I/O复用函数中。signal的集成实现在event_base比较有特色，采用了一个socket pair方式，每当收到注册信号时，将对应的信号注册到singal链中，不对相应的信号执行signal()函数，而是统一的触发函数，当信号被触发时，处理函数会将信号写到socket的写端，event_base的I/O复用函数就能收到socket的可读请求，将信号事件的处理添加奥actice链种。 这样，event的读写，timeout，singal就被event_base集成到一起。 &#160; 下面深入探讨下event的管理，event结构定义如下: TAILQ_ENTRY(event) ev_active_next; union { TAILQ_ENTRY(event) ev_next_with_common_timeout; int min_heap_idx; } ev_timeout_pos; evutil_socket_t ev_fd; struct event_base *ev_base; union { /* used for io events */ struct { LIST_ENTRY (event) ev_io_next; [...]]]></description>
			<content:encoded><![CDATA[<p>libevent 已经更新到2.0版本。 以前的很多函数已经被放弃，像event_set()，event_init(),event_loop()等等。</p>
<p>libevent主要有三个结构</p>
<p>struct event : event的核心结构，即用户的callback函数绑定到event结构上，当发生一定I/O条件或singal或超时，callback被调用。</p>
<p>struct event_base:event被绑定到event_base,event_base是底层I/O复用函数的包装，它负责管理各种注册的event结构和调度。</p>
<p>struct event_config: 为了避免event_base的构造器太复杂，提供了event_config结构对配置参数进行了包装，设置features和flags。</p>
<p>&nbsp;</p>
<p>struct event_base相当于底层I/O复用函数的实例，每个底层I/O复用函数都定义了add,del,dispatch,dealloc等函数，提供了统一的接口供上层调用。event_base中有一个指针指向了某个底层I/O复用函数封装的结构struct eventop.</p>
<p>每次调用event_base的注册事件就相当于调用了某个底层I/O复用函数定义的接口函数。</p>
<p>event_base还保存了若干个event chains，比如active events链，所有events链和pending events链.还有一个小根的时间堆，负责对timeout callback的处理，每次从时间堆中提取最小timeout，放到I/O复用函数中。signal的集成实现在event_base比较有特色，采用了一个socket pair方式，每当收到注册信号时，将对应的信号注册到singal链中，不对相应的信号执行signal()函数，而是统一的触发函数，当信号被触发时，处理函数会将信号写到socket的写端，event_base的I/O复用函数就能收到socket的可读请求，将信号事件的处理添加奥actice链种。</p>
<p>这样，event的读写，timeout，singal就被event_base集成到一起。</p>
<p>&nbsp;</p>
<p>下面深入探讨下event的管理，event结构定义如下:</p>
<pre>
TAILQ_ENTRY(event) ev_active_next;

union {

TAILQ_ENTRY(event) ev_next_with_common_timeout;

int min_heap_idx;

} ev_timeout_pos;

evutil_socket_t ev_fd;

struct event_base *ev_base;

union {

/* used for io events */

struct {

LIST_ENTRY (event) ev_io_next;

struct timeval ev_timeout;

} ev_io;

/* used by signal events */

struct {

LIST_ENTRY (event) ev_signal_next;

short ev_ncalls;

/* Allows deletes in callback */

short *ev_pncalls;

} ev_signal;

} ev_;

short ev_events;

short ev_res;          /* result passed to event callback */

short ev_flags;

ev_uint8_t ev_pri;     /* smaller numbers are higher priority */

ev_uint8_t ev_closure;

struct timeval ev_timeout;

/* allows us to adopt for different types of events */

void (*ev_callback)(evutil_socket_t, short, void *arg);

void *ev_arg;</pre>
<p>TAILQ_ENTRY 是一个宏，定义了一个结构，现在只需知道ev_active_next是该event结构在event_base中active链中的索引位置。ev_io和ev_sigal是对立的两种事件类型，采用union结构联合。ev_events就是事件的类型，还有callback函数，参数等，从名字和注释就可以很好理解。</p>
<p>其他部件：</p>
<p>struct eventbuffer: libevent提供I/O读写的抽象，不用直接read或者write到fd，而是用bufferread()和buffer write()进行读写</p>
<p>priority: libevent提供了事件优先级的管理，event_base_priority_init()初始化了event_base的优先级范围，默认的event会采用优先级范围的中值。</p>
<p>现在libevent的dispatch过程就很清晰了，首先初始化event_base结构，event_new()初始化event,event_add()使event pending。event_base_dispatch()开始了事件的循环，event_base首先从激活的事件中收集fd，在小根时间堆中找到最小timeout，然后对每一个信号事件进行注册，当发生可读或可写后，event_base检查socket是否可读，然后收集被触发的信号事件，收集fd可读可写事件和timeout。添加到active链中。然后根据优先级callback。event_base_loop()还提供了一系列的参数进行配置，如是执行高优先事件还是检查是否有新的active事件。或者执行callback后的行为。</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wzxue.com/%e5%bf%ab%e9%80%9f%e7%90%86%e8%a7%a3libevent%e5%86%85%e5%9c%a8%e6%9c%ba%e5%88%b6/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How Fixtures Loaded &#8212; loaddata command (Dive Into DJango)</title>
		<link>http://www.wzxue.com/how-fixtures-load-loaddata-command-dive-into-django/</link>
		<comments>http://www.wzxue.com/how-fixtures-load-loaddata-command-dive-into-django/#comments</comments>
		<pubDate>Mon, 09 Apr 2012 06:07:49 +0000</pubDate>
		<dc:creator>麦子麦</dc:creator>
				<category><![CDATA[python]]></category>
		<category><![CDATA[网站]]></category>
		<category><![CDATA[django]]></category>
		<category><![CDATA[fixtures]]></category>

		<guid isPermaLink="false">http://www.wzxue.com/?p=517</guid>
		<description><![CDATA[class SampleTestCase(TestCase): fixtures = ['person', 'address'] &#160; From django/core/management/commands/loaddata.py    Line 95 &#160; for fixture_label in fixture_labels: parts = fixture_label.split('.') &#160; fixture_labels default is [initial_data, ] , here we define fixtures = ['person', 'address'], so fixture_labels is ['initial_data', 'person', 'address'] &#160; &#160; From django/core/management/commands/loaddata.py    Line 98 &#160; if len(parts) &#38;gt; 1 and parts[-1] in [...]]]></description>
			<content:encoded><![CDATA[<pre><code>class SampleTestCase(TestCase):
  fixtures = ['person', 'address']
</code></pre>
<p>&nbsp;</p>
<p>From django/core/management/commands/loaddata.py    Line 95</p>
<p>&nbsp;</p>
<pre><code>for fixture_label in fixture_labels:

  parts = fixture_label.split('.')
</code></pre>
<p>&nbsp;</p>
<p><code>fixture_labels</code> default is [initial_data, ] , here we define fixtures = ['person', 'address'], so <code>fixture_labels</code> is ['initial_data', 'person', 'address']</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>From django/core/management/commands/loaddata.py    Line 98</p>
<p>&nbsp;</p>
<pre><code>if len(parts) &amp;gt; 1 and parts[-1] in compression_types:

    compression_formats = [parts[-1]]

    parts = parts[:-1]

else:

    compression_formats = compression_types.keys()

    &amp;nbsp;

if len(parts) == 1:

    fixture_name = parts[0]

    formats = serializers.get_public_serializer_formats()

else:

    fixture_name, format = '.'.join(parts[:-1]), parts[-1]

if format in serializers.get_public_serializer_formats():

    formats = [format]

else:

    formats = []
</code></pre>
<p>&nbsp;</p>
<p>If we list in the SampleTestCase.fixtures didn&#8217;t have a compression formats. By default, <code>compression_formats</code> is ['gz', 'zip', 'bz2'].</p>
<p>If parts didn&#8217;t have formats such as &#8216;json&#8217;, &#8216;xml&#8217;, formats will be ['xml', 'json'], in the later, program will be to guess the possible format.</p>
<p>Here, we only have filename is person, address, so program will guess the formats.</p>
<p>&nbsp;</p>
<p>From django/core/management/commands/loaddata.py    Line 126</p>
<p>&nbsp;</p>
<pre><code>if os.path.isabs(fixture_name):

    fixture_dirs = [fixture_name]

else:

    fixture_dirs = app_fixtures + list(settings.FIXTURE_DIRS) + ['']
</code></pre>
<p>&nbsp;</p>
<p>If file in the <code>fixtures</code> is a absolute path, program will search only in the path. Otherwise, program will search all the apps_fixtures paths.</p>
<p>&nbsp;</p>
<p>From django/core/management/commands/loaddata.py    Line 136</p>
<p>&nbsp;</p>
<pre><code>for combo in product([using, None], formats, compression_formats):

    database, format, compression_format = combo

    file_name = '.'.join(

        p for p in [

            fixture_name, database, format, compression_format

        ]

        if p

    )
</code></pre>
<p>&nbsp;</p>
<p><code>combo</code> will be the Cartesian product constructed by filename, formats, compression_formats.</p>
<p>&nbsp;</p>
<p>From django/core/management/commands/loaddata.py    Line 148</p>
<p>&nbsp;</p>
<pre><code>full_path = os.path.join(fixture_dir, file_name)

open_method = compression_types[compression_format]

try:

    fixture = open_method(full_path, 'r')
</code></pre>
<p><code>full_path</code> is the possible fixture_dir + file_name, program will tend to open it, if failed, program tries another.</p>
<p>Above all, we can get that we had better to decide a absolute path and absolute filename in the fixtures to avoid meaningless guess.</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wzxue.com/how-fixtures-load-loaddata-command-dive-into-django/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Sentiment Analysis</title>
		<link>http://www.wzxue.com/sentiment-analysis/</link>
		<comments>http://www.wzxue.com/sentiment-analysis/#comments</comments>
		<pubDate>Sun, 08 Apr 2012 07:39:30 +0000</pubDate>
		<dc:creator>麦子麦</dc:creator>
				<category><![CDATA[Natural Language Process]]></category>
		<category><![CDATA[Sentiment Analysis]]></category>

		<guid isPermaLink="false">http://www.wzxue.com/?p=510</guid>
		<description><![CDATA[Baseline Algorithm Tokenization Feature Extraction Classification using different classifiers Naive Bayes MaxEnt SVM Sentiment Tokenize Issues Deal with HTML and XML mark Twitter mark-up (names, hash tags) Capitalization (preserve for words in all caps) Phone numbers, dates Emoticons Extracting Feature for Sentiment Classifier How to handle negation Which words to use(all words or adjective words) [...]]]></description>
			<content:encoded><![CDATA[<p>Baseline Algorithm</p>
<ul>
<li>
<p>Tokenization</p>
</li>
<li>
<p>Feature Extraction</p>
</li>
<li>
<p>Classification using different classifiers</p>
</li>
<li>
<p>Naive Bayes</p>
</li>
<li>
<p>MaxEnt</p>
</li>
<li>
<p>SVM</p>
</li>
</ul>
<h4>Sentiment Tokenize Issues</h4>
<ul>
<li>
<p>Deal with HTML and XML mark</p>
</li>
<li>
<p>Twitter mark-up (names, hash tags)</p>
</li>
<li>
<p>Capitalization (preserve for words in all caps)</p>
</li>
<li>
<p>Phone numbers, dates</p>
</li>
<li>
<p>Emoticons</p>
</li>
</ul>
<h4>Extracting Feature for Sentiment Classifier</h4>
<ul>
<li>
<p>How to handle negation</p>
</li>
<li>
<p>Which words to use(all words or adjective words)</p>
</li>
</ul>
<h4>Boolean Multinomial Naive Bayes</h4>
<h4>Hard problem in sentiment</h4>
<ul>
<li>
<p>Subtlety sentiment</p>
</li>
<li>
<p>Thwarted Expectations and Ordering Effects</p>
</li>
</ul>
<h2>Build lexicons</h2>
<h4>Hatzivasiloglou and McKeown intuition for identifying word polarity</h4>
<ul>
<li>
<p>Adjectives conjoined by &#8216;and&#8217; have same polarity</p>
</li>
<li>
<p>Adjectives conjoined by &#8216;but&#8217; do not</p>
</li>
<li>
<p>Label seed set of 1336 adjectives</p>
</li>
<li>
<p>Expand seed set to conjoined adjectives by Google</p>
</li>
<li>
<p>Supervised classifier assigns &#8216;polarity similarity&#8217;  to each word pair, resulting in graph;</p>
</li>
</ul>
<p><a href="http://www.wzxue.com/wp-content/uploads/2012/04/0d11fa7458c6aa5e5473c3fba48dddb6.jpeg"><img class="alignnone size-large wp-image-511" title="0d11fa7458c6aa5e5473c3fba48dddb6" src="http://www.wzxue.com/wp-content/uploads/2012/04/0d11fa7458c6aa5e5473c3fba48dddb6-1024x322.jpg" alt="" width="620" height="194" data-pinit="registered" /></a></p>
<ul>
<li>Clustering for partitioning the graph into two</li>
</ul>
<h4>Turney Algorithm</h4>
<p>&nbsp;</p>
<ol>
<li>
<p>Extract a phrasal lexicon from reviews</p>
</li>
<li>
<p>Learn polarity of each phrase</p>
</li>
<li>
<p>Rate a review by the average polarity of its phrasal</p>
</li>
</ol>
<p>&nbsp;</p>
<h4>Using WordNet to learn polarity</h4>
<ul>
<li>
<p>WordNet: online thesaurus</p>
</li>
<li>
<p>Create positive and the negative seed-words</p>
</li>
<li>
<p>Find Synonyms and Antonyms</p>
</li>
<li>
<p>Positive Set: Add synonyms of positive words and antonyms of negative words</p>
</li>
<li>
<p>Negative Set: Add synonyms of negative words and antonyms of positive words.</p>
</li>
<li>
<p>Repeat, following chains of synonyms</p>
</li>
<li>
<p>Filter</p>
</li>
</ul>
<h4>Finding sentiment of a sentence</h4>
<ul>
<li>
<p>Important for finding aspects or attributes or target of sentiment</p>
</li>
<li>
<p>Frequent phrases + rules</p>
</li>
<li>
<p>Filter by rules like “occurs right after sentiment word”</p>
</li>
</ul>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<h4>A example to sentiment analysis</h4>
<p>&nbsp;</p>
<p>Using Naive Bayes to train.</p>
<p>&nbsp;</p>
<p><code>import sys</p>
<p>import getopt</p>
<p>import os</p>
<p>import math</p>
<p>from collections import defaultdict</p>
<p>&nbsp;</p>
<p>class NaiveBayes:</p>
<p>class TrainSplit:</p>
<p>"""Represents a set of training/testing data. self.train is a list of Examples, as is self.test.</p>
<p>"""</p>
<p>def <strong>init</strong>(self):</p>
<p>self.train = []</p>
<p>self.test = []</p>
<p>&nbsp;</p>
<p>class Example:</p>
<p>"""Represents a document with a label. klass is 'pos' or 'neg' by convention.</p>
<p>words is a list of strings.</p>
<p>"""</p>
<p>def <strong>init</strong>(self):</p>
<p>self.klass = ''</p>
<p>self.words = []</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>def <strong>init</strong>(self):</p>
<p>"""NaiveBayes initialization"""</p>
<p>self.FILTER_STOP_WORDS = False</p>
<p>self.stopList = set(self.readFile('../data/english.stop'))</p>
<p>self.numFolds = 10</p>
<p>self.pos_words = defaultdict(lambda: 0)</p>
<p>self.neg_words = defaultdict(lambda: 0)</p>
<p>self.first_classify = True</p>
<p>self.count_of_pos_examples = 0</p>
<p>self.count_of_neg_examples = 0</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>def classify(self, words):</p>
<p>'words' is a list of words to classify. Return 'pos' or 'neg' classification.</p>
<p>"""</p>
<p>if self.first_classify:</p>
<p>self.first_classify = False</p>
<p>self.count_of_vocabulary = len(set(self.pos_words.keys() + self.neg_words.keys()))</p>
<p>self.count_of_examples = self.count_of_pos_examples + self.count_of_neg_examples</p>
<p>self.count_of_pos = sum([i for i in self.pos_words.values()])</p>
<p>self.count_of_neg = sum([i for i in self.neg_words.values()])</p>
<p>&nbsp;</p>
<p>pos_pro = neg_pro = 0</p>
<p>for word in words:</p>
<p>pos_pro += math.log(float(self.pos_words[word]+1)/(self.count_of_vocabulary+self.count_of_pos))</p>
<p>neg_pro += math.log(float(self.neg_words[word]+1)/(self.count_of_vocabulary+self.count_of_neg))</p>
<p>pos_pro += math.log(float(self.count_of_pos_examples)/self.count_of_examples)</p>
<p>neg_pro += math.log(float(self.count_of_neg_examples)/self.count_of_examples)</p>
<p>if pos_pro &gt; neg_pro:</p>
<p>return 'pos'</p>
<p>else:</p>
<p>return 'neg'</p>
<p>&nbsp;</p>
<p>def addExample(self, klass, words):</p>
<p>"""</p>
<ul>
<li>
<p>Train your model on an example document with label klass ('pos' or 'neg') and</p>
</li>
<li>
<p>words, a list of strings.</p>
</li>
<li>
<p>You should store whatever data structures you use for your classifier</p>
</li>
<li>
<p>in the NaiveBayes class.</p>
</li>
<li>
<p>Returns nothing</p>
</li>
</ul>
<p>"""</p>
<p>if klass == 'pos':</p>
<p>d = self.pos_words</p>
<p>self.count_of_pos_examples += 1</p>
<p>else:</p>
<p>d = self.neg_words</p>
<p>self.count_of_neg_examples += 1</p>
<p>&nbsp;</p>
<p>for i in words:</p>
<p>d[i] += 1</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>def readFile(self, fileName):</p>
<p>"""</p>
<ul>
<li>
<p>Code for reading a file.  you probably don't want to modify anything here,</p>
</li>
<li>
<p>unless you don't like the way we segment files.</p>
</li>
</ul>
<p>&gt;&gt;&gt; b = NaiveBayes()</p>
<p>&nbsp;</p>
<p>"""</p>
<p>contents = []</p>
<p>f = open(fileName)</p>
<p>for line in f:</p>
<p>contents.append(line)</p>
<p>f.close()</p>
<p>result = self.segmentWords('&#92;n'.join(contents))</p>
<p>return result</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>def segmentWords(self, s):</p>
<p>"""</p>
<ul>
<li>Splits lines on whitespace for file reading</li>
</ul>
<p>"""</p>
<p>return s.split()</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>def trainSplit(self, trainDir):</p>
<p>"""Takes in a trainDir, returns one TrainSplit with train set."""</p>
<p>split = self.TrainSplit()</p>
<p>posTrainFileNames = os.listdir('%s/pos/' % trainDir)</p>
<p>negTrainFileNames = os.listdir('%s/neg/' % trainDir)</p>
<p>for fileName in posTrainFileNames:</p>
<p>example = self.Example()</p>
<p>example.words = self.readFile('%s/pos/%s' % (trainDir, fileName))</p>
<p>example.klass = 'pos'</p>
<p>split.train.append(example)</p>
<p>for fileName in negTrainFileNames:</p>
<p>example = self.Example()</p>
<p>example.words = self.readFile('%s/neg/%s' % (trainDir, fileName))</p>
<p>example.klass = 'neg'</p>
<p>split.train.append(example)</p>
<p>return split</p>
<p>&nbsp;</p>
<p>def train(self, split):</p>
<p>for example in split.train:</p>
<p>words = example.words</p>
<p>if self.FILTER_STOP_WORDS:</p>
<p>words =  self.filterStopWords(words)</p>
<p>self.addExample(example.klass, words)</p>
<p>&nbsp;</p>
<p>def crossValidationSplits(self, trainDir):</p>
<p>"""Returns a lsit of TrainSplits corresponding to the cross validation splits."""</p>
<p>splits = []</p>
<p>posTrainFileNames = os.listdir('%s/pos/' % trainDir)</p>
<p>negTrainFileNames = os.listdir('%s/neg/' % trainDir)</p>
<h1>for fileName in trainFileNames:</h1>
<p>for fold in range(0, self.numFolds):</p>
<p>split = self.TrainSplit()</p>
<p>for fileName in posTrainFileNames:</p>
<p>example = self.Example()</p>
<p>example.words = self.readFile('%s/pos/%s' % (trainDir, fileName))</p>
<p>example.klass = 'pos'</p>
<p>if fileName[2] == str(fold):</p>
<p>split.test.append(example)</p>
<p>else:</p>
<p>split.train.append(example)</p>
<p>for fileName in negTrainFileNames:</p>
<p>example = self.Example()</p>
<p>example.words = self.readFile('%s/neg/%s' % (trainDir, fileName))</p>
<p>example.klass = 'neg'</p>
<p>if fileName[2] == str(fold):</p>
<p>split.test.append(example)</p>
<p>else:</p>
<p>split.train.append(example)</p>
<p>splits.append(split)</p>
<p>return splits</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>def test(self, split):</p>
<p>"""Returns a list of labels for split.test."""</p>
<p>labels = []</p>
<p>for example in split.test:</p>
<p>words = example.words</p>
<p>if self.FILTER_STOP_WORDS:</p>
<p>words =  self.filterStopWords(words)</p>
<p>guess = self.classify(words)</p>
<p>labels.append(guess)</p>
<p>return labels</p>
<p>&nbsp;</p>
<p>def buildSplits(self, args):</p>
<p>"""Builds the splits for training/testing"""</p>
<p>trainData = []</p>
<p>testData = []</p>
<p>splits = []</p>
<p>trainDir = args[0]</p>
<p>if len(args) == 1:</p>
<p>print '[INFO]&#92;tPerforming %d-fold cross-validation on data set:&#92;t%s' % (self.numFolds, trainDir)</p>
<p>&nbsp;</p>
<p>posTrainFileNames = os.listdir('%s/pos/' % trainDir)</p>
<p>negTrainFileNames = os.listdir('%s/neg/' % trainDir)</p>
<p>for fold in range(0, self.numFolds):</p>
<p>split = self.TrainSplit()</p>
<p>for fileName in posTrainFileNames:</p>
<p>example = self.Example()</p>
<p>example.words = self.readFile('%s/pos/%s' % (trainDir, fileName))</p>
<p>example.klass = 'pos'</p>
<p>if fileName[2] == str(fold):</p>
<p>split.test.append(example)</p>
<p>else:</p>
<p>split.train.append(example)</p>
<p>for fileName in negTrainFileNames:</p>
<p>example = self.Example()</p>
<p>example.words = self.readFile('%s/neg/%s' % (trainDir, fileName))</p>
<p>example.klass = 'neg'</p>
<p>if fileName[2] == str(fold):</p>
<p>split.test.append(example)</p>
<p>else:</p>
<p>split.train.append(example)</p>
<p>splits.append(split)</p>
<p>elif len(args) == 2:</p>
<p>split = self.TrainSplit()</p>
<p>testDir = args[1]</p>
<p>print '[INFO]&#92;tTraining on data set:&#92;t%s testing on data set:&#92;t%s' % (trainDir, testDir)</p>
<p>posTrainFileNames = os.listdir('%s/pos/' % trainDir)</p>
<p>negTrainFileNames = os.listdir('%s/neg/' % trainDir)</p>
<p>for fileName in posTrainFileNames:</p>
<p>example = self.Example()</p>
<p>example.words = self.readFile('%s/pos/%s' % (trainDir, fileName))</p>
<p>example.klass = 'pos'</p>
<p>split.train.append(example)</p>
<p>for fileName in negTrainFileNames:</p>
<p>example = self.Example()</p>
<p>example.words = self.readFile('%s/neg/%s' % (trainDir, fileName))</p>
<p>example.klass = 'neg'</p>
<p>split.train.append(example)</p>
<p>&nbsp;</p>
<p>posTestFileNames = os.listdir('%s/pos/' % testDir)</p>
<p>negTestFileNames = os.listdir('%s/neg/' % testDir)</p>
<p>for fileName in posTestFileNames:</p>
<p>example = self.Example()</p>
<p>example.words = self.readFile('%s/pos/%s' % (testDir, fileName))</p>
<p>example.klass = 'pos'</p>
<p>split.test.append(example)</p>
<p>for fileName in negTestFileNames:</p>
<p>example = self.Example()</p>
<p>example.words = self.readFile('%s/neg/%s' % (testDir, fileName))</p>
<p>example.klass = 'neg'</p>
<p>split.test.append(example)</p>
<p>splits.append(split)</p>
<p>return splits</p>
<p>&nbsp;</p>
<p>def filterStopWords(self, words):</p>
<p>"""Filters stop words."""</p>
<p>filtered = []</p>
<p>for word in words:</p>
<p>if not word in self.stopList and word.strip() != '':</p>
<p>filtered.append(word)</p>
<p>return filtered</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>def main():</p>
<p>nb = NaiveBayes()</p>
<p>(options, args) = getopt.getopt(sys.argv[1:], 'f')</p>
<p>if ('-f','') in options:</p>
<p>nb.FILTER_STOP_WORDS = True</p>
<p>&nbsp;</p>
<p>splits = nb.buildSplits(args)</p>
<p>avgAccuracy = 0.0</p>
<p>fold = 0</p>
<p>for split in splits:</p>
<p>classifier = NaiveBayes()</p>
<p>accuracy = 0.0</p>
<p>for example in split.train:</p>
<p>words = example.words</p>
<p>if nb.FILTER_STOP_WORDS:</p>
<p>words =  classifier.filterStopWords(words)</p>
<p>classifier.addExample(example.klass, words)</p>
<p>&nbsp;</p>
<p>for example in split.test:</p>
<p>words = example.words</p>
<p>if nb.FILTER_STOP_WORDS:</p>
<p>words =  classifier.filterStopWords(words)</p>
<p>guess = classifier.classify(words)</p>
<p>if example.klass == guess:</p>
<p>accuracy += 1.0</p>
<p>&nbsp;</p>
<p>accuracy = accuracy / len(split.test)</p>
<p>avgAccuracy += accuracy</p>
<p>print '[INFO]&#92;tFold %d Accuracy: %f' % (fold, accuracy)</p>
<p>fold += 1</p>
<p>avgAccuracy = avgAccuracy / fold</p>
<p>print '[INFO]&#92;tAccuracy: %f' % avgAccuracy</p>
<p>&nbsp;</p>
<p>if <strong>name</strong> == "<strong>main</strong>":</p>
<p>main()</p>
<p>&nbsp;</code></p>
]]></content:encoded>
			<wfw:commentRss>http://www.wzxue.com/sentiment-analysis/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Different Language Model comparison</title>
		<link>http://www.wzxue.com/different-language-model-comparison/</link>
		<comments>http://www.wzxue.com/different-language-model-comparison/#comments</comments>
		<pubDate>Sun, 25 Mar 2012 06:57:42 +0000</pubDate>
		<dc:creator>麦子麦</dc:creator>
				<category><![CDATA[Natural Language Process]]></category>
		<category><![CDATA[Edit model]]></category>
		<category><![CDATA[Language Model]]></category>

		<guid isPermaLink="false">http://www.wzxue.com/?p=503</guid>
		<description><![CDATA[I implemente a noisy-channel model for spelling correction, including likeihood tem, edit model and several different language models. Also I implemente a test framwork to compare the different language model. &#160; Data &#160; I use the writing of secondary-school children. There is a summary of contents: &#160; A corpus to train language model A corpus [...]]]></description>
			<content:encoded><![CDATA[<p>I implemente a <strong>noisy-channel</strong> model for spelling correction, including likeihood tem, edit model and several different language models. Also I implemente a test framwork to compare the different language model.</p>
<p>&nbsp;</p>
<h2>Data</h2>
<p>&nbsp;</p>
<p>I use the writing of secondary-school children. There is a summary of contents:</p>
<p>&nbsp;</p>
<ul>
<li>
<p>A corpus to train language model</p>
</li>
<li>
<p>A corpus of spelling errors for development</p>
</li>
<li>
<p>A table listing counts of edits &#8216;x|w&#8217;, taken from Wikipedia.</p>
</li>
</ul>
<p>&nbsp;</p>
<h2>Language Models</h2>
<p>&nbsp;</p>
<ul>
<li>
<p><em>Laplace Unigram Language Model</em>: a unigram model with add-one smoothing. Treat out-of-vocabulary items as a word which zero times in traing.</p>
</li>
<li>
<p><em>Laplace Bigram Language Model</em>: a bigram model with add-one smoothing.</p>
</li>
<li>
<p><em>Stupid Backoff Language Model</em>: use a unsmoothed bigram model combined with backoff to and add-one smoothed unigram model.</p>
</li>
<li>
<p><em>Add-k Trigram Language Model</em>: a trigram model with add-k smoothing.</p>
</li>
</ul>
<p>&nbsp;</p>
<p>Every language models conform the same interface, accpte corpus to train and get test sentece to spell correct. finally, return scores for sentences represent the likihood of correction.</p>
<p>&nbsp;</p>
<h2>Code</h2>
<p>&nbsp;</p>
<p><code>import collections, math</p>
<p>&nbsp;</p>
<p>class LaplaceUnigramLanguageModel:</p>
<p>"""</p>
<p>&gt;&gt;&gt; import HolbrookCorpus</p>
<p>&gt;&gt;&gt; l =</p>
<p>&gt;&gt;&gt; LaplaceUnigramLanguageModel(HolbrookCorpus.HolbrookCorpus('../data/holbrook-tagged-train.dat'))</p>
<p>&gt;&gt;&gt; print l.count</p>
<p>"""</p>
<p>&nbsp;</p>
<p>def <strong>init</strong>(self, corpus):</p>
<p>"""Initialize your data structures in the constructor."""</p>
<p>self.data = collections.defaultdict(lambda: 0)</p>
<p>self.count = 0</p>
<p>self.word_type = 0</p>
<p>self.train(corpus)</p>
<p>&nbsp;</p>
<p>def train(self, corpus):</p>
<p>""" Takes a corpus and trains your language model.</p>
<p>Compute any counts or other corpus statistics in this function.</p>
<p>"""</p>
<p>for sentence in corpus.corpus:</p>
<p>for datum in sentence.data:</p>
<p>self.data[datum.word] += 1</p>
<p>&nbsp;</p>
<p>for i in self.data:</p>
<p>self.word_type += 1</p>
<p>self.count += self.data[i]</p>
<p>&nbsp;</p>
<p>def score(self, sentence):</p>
<p>""" Takes a list of strings as argument and returns the log-probability of the</p>
<p>sentence using your language model. Use whatever data you computed in train() here.</p>
<p>"""</p>
<p>score = 0.0</p>
<p>for token in sentence:</p>
<p>score += math.log(self.data[token] + 1) / math.log(self.count + self.word_type)</p>
<p>return score</code></p>
<hr />
<p><code>import collections, math</p>
<p>class LaplaceBigramLanguageModel:</p>
<p>&nbsp;</p>
<p>def <strong>init</strong>(self, corpus):</p>
<p>"""Initialize your data structures in the constructor."""</p>
<p>self.pair_words = collections.defaultdict(lambda: 0)</p>
<p>self.words = collections.defaultdict(lambda: 0)</p>
<p>self.word_types = 0</p>
<p>self.train(corpus)</p>
<p>&nbsp;</p>
<p>def train(self, corpus):</p>
<p>""" Takes a corpus and trains your language model.</p>
<p>Compute any counts or other corpus statistics in this function.</p>
<p>"""</p>
<p>for sentence in corpus.corpus:</p>
<p>for i in range(len(sentence.data)-1):</p>
<p>if i == 0:</p>
<p>key = ' &lt;- %s' % (sentence.data[i].word)</p>
<p>else:</p>
<p>key = '%s &lt;- %s' % (sentence.data[i-1].word, sentence.data[i].word)</p>
<p>self.pair_words[key] += 1</p>
<p>self.words[sentence.data[i].word] += 1</p>
<p>self.words[''] += 1</p>
<p>&nbsp;</p>
<p>def score(self, sentence):</p>
<p>""" Takes a list of strings as argument and returns the log-probability of the</p>
<p>sentence using your language model. Use whatever data you computed in train() here.</p>
<p>"""</p>
<p>score = 0.0</p>
<p>for i in range(len(sentence)-1):</p>
<p>if i == 0:</p>
<p>key = ' &lt;- %s' % (sentence[i])</p>
<p>key2 = ''</p>
<p>else:</p>
<p>key = '%s &lt;- %s' % (sentence[i-1], sentence[i])</p>
<p>key2= sentence[i-1]</p>
<p>score += math.log(self.pair_words[key] + 1) / math.log(self.words[key2] + len(self.words))</p>
<p>return score</code></p>
<hr />
<p><code>import collections, math</p>
<p>&nbsp;</p>
<p>class StupidBackoffLanguageModel:</p>
<p>&nbsp;</p>
<p>def <strong>init</strong>(self, corpus):</p>
<p>"""Initialize your data structures in the constructor."""</p>
<p>self.pair_words = collections.defaultdict(lambda: 0)</p>
<p>self.words = collections.defaultdict(lambda: 0)</p>
<p>self.word_types = 0</p>
<p>self.count = 0</p>
<p>self.train(corpus)</p>
<p>&nbsp;</p>
<p>def train(self, corpus):</p>
<p>""" Takes a corpus and trains your language model.</p>
<p>Compute any counts or other corpus statistics in this function.</p>
<p>"""</p>
<p>for sentence in corpus.corpus:</p>
<p>for i in range(len(sentence.data)-1):</p>
<p>if i == 0:</p>
<p>key = ' &lt;- %s' % (sentence.data[i].word)</p>
<p>else:</p>
<p>key = '%s &lt;- %s' % (sentence.data[i-1].word, sentence.data[i].word)</p>
<p>self.pair_words[key] += 1</p>
<p>self.words[sentence.data[i].word] += 1</p>
<p>self.words[''] += 1</p>
<p>&nbsp;</p>
<p>for i in self.words:</p>
<p>self.word_types += 1</p>
<p>self.count += self.words[i]</p>
<p>&nbsp;</p>
<p>def score(self, sentence):</p>
<p>""" Takes a list of strings as argument and returns the log-probability of the</p>
<p>sentence using your language model. Use whatever data you computed in train() here.</p>
<p>"""</p>
<p>score = 0.0</p>
<p>for i in range(len(sentence)-1):</p>
<p>if i == 0:</p>
<p>key = ' &lt;- %s' % (sentence[i])</p>
<p>key2 = ''</p>
<p>else:</p>
<p>key = '%s &lt;- %s' % (sentence[i-1], sentence[i])</p>
<p>key2= sentence[i-1]</p>
<p>if self.pair_words[key]:</p>
<p>score += float(self.pair_words[key]) / (self.words[key2])</p>
<p>else:</p>
<p>score += 0.4 * float(self.words[sentence[i]] + 1) / (self.count + self.word_types)</p>
<p>return score</p>
<h2></code></h2>
<p><code>import collections, math</p>
<p>&nbsp;</p>
<p>class CustomLanguageModel:</p>
<p>&nbsp;</p>
<p>def <strong>init</strong>(self, corpus):</p>
<p>"""Initialize your data structures in the constructor."""</p>
<p>self.pair_words = collections.defaultdict(lambda: 0)</p>
<p>self.three_words = collections.defaultdict(lambda: 0)</p>
<p>self.words = collections.defaultdict(lambda :0)</p>
<p>self.train(corpus)</p>
<p>&nbsp;</p>
<p>def train(self, corpus):</p>
<p>""" Takes a corpus and trains your language model.</p>
<p>Compute any counts or other corpus statistics in this function.</p>
<p>"""</p>
<p>for sentence in corpus.corpus:</p>
<p>for i in range(len(sentence.data)-1):</p>
<p>if i == 0:</p>
<p>key = ' &lt;- %s' % (sentence.data[i].word)</p>
<p>key2 = ''</p>
<p>if i == 1:</p>
<p>key = key2 = '%s &lt;- %s' % (sentence.data[i-1].word, sentence.data[i].word)</p>
<p>else:</p>
<p>key = '%s %s &lt;- %s' % (sentence.data[i-2].word, sentence.data[i-1].word, sentence.data[i].word)</p>
<p>key2 = '%s &lt;- %s' % (sentence.data[i-1].word, sentence.data[i].word)</p>
<p>self.pair_words[key2] += 1</p>
<p>self.three_words[key] += 1</p>
<p>self.words[sentence.data[i].word] += 1</p>
<p>&nbsp;</p>
<p>def score(self, sentence):</p>
<p>""" Takes a list of strings as argument and returns the log-probability of the</p>
<p>sentence using your language model. Use whatever data you computed in train() here.</p>
<p>"""</p>
<p>score = 0.0</p>
<p>&nbsp;</p>
<p>for i in range(len(sentence)-1):</p>
<p>if i == 0:</p>
<p>key = ' &lt;- %s' % (sentence[i])</p>
<p>key2 = ''</p>
<p>if i == 1:</p>
<p>key = key2 = '%s &lt;- %s' % (sentence[i-1], sentence[i])</p>
<p>else:</p>
<p>key = '%s %s &lt;- %s' % (sentence[i-2], sentence[i-1], sentence[i])</p>
<p>key2 = '%s &lt;- %s' % (sentence[i-1], sentence[i])</p>
<p>score += math.log(self.three_words[key] + float(3) / len(self.words)) / math.log(self.pair_words[key2] + 3)</p>
<p>return score</p>
<p>&nbsp;</p>
<h2></code></h2>
<p>&nbsp;</p>
<h2>Evaluation</h2>
<ul>
<li>
<p>Laplace Unigram Language Model: 0.12</p>
</li>
<li>
<p>Laplace Bigram Language Model: 0.13</p>
</li>
<li>
<p>Stupid Backoff Language Model: 0.12</p>
</li>
<li>
<p>Add-k Trigram Language Model: 0.16</p>
</li>
</ul>
<p>&nbsp;</p>
<p>The result is the accuracy of spell correction. Total wrong words are 471, and most of model can spell correct up to 60.</p>
<p>&nbsp;</p>
<p>Laplace Unigram model and Laplace Bigram Model can reach my goal, but Stupid Backoff model don&#8217;t get the expect result. And Trigram Model too.</p>
<p>&nbsp;</p>
<h2>Shortcoming</h2>
<p>&nbsp;</p>
<ol>
<li>
<p>Edit model is too simple and limit the spell correct.</p>
</li>
<li>
<p>Because of the too small traing corpus, stupid backoff language model can&#8217;t get content result. Stupid backoff model is designed to web-scale corpus.</p>
</li>
</ol>
<p>&nbsp;</p>
<h2>Referencing:</h2>
<p>&nbsp;</p>
<ul>
<li>
<p><a href="http://www.<a href="http://52nlp.com/">52nlp.com</a>“>我爱自然语言处理</a></p>
</li>
<li>
<p><a href="http://nlp.<a href="http://mit.org/">mit.org</a>“>mit NLP course</a></p>
</li>
</ul>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wzxue.com/different-language-model-comparison/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Language Spell Correct &#8212; Noisy Channel Model</title>
		<link>http://www.wzxue.com/490/</link>
		<comments>http://www.wzxue.com/490/#comments</comments>
		<pubDate>Fri, 23 Mar 2012 16:54:48 +0000</pubDate>
		<dc:creator>麦子麦</dc:creator>
				<category><![CDATA[Natural Language Process]]></category>
		<category><![CDATA[NLP]]></category>

		<guid isPermaLink="false">http://www.wzxue.com/?p=490</guid>
		<description><![CDATA[A sentence from origin words to present words though a noisy channel. &#160; We see an observation of a misspelled word x Find the correct word w w = argmax P(x&#124;w)P(w) &#160; From this equation, we can learn that the correct word w is the probability of w misspelled to x, and the probability of word w. [...]]]></description>
			<content:encoded><![CDATA[<p>A sentence from origin words to present words though a noisy channel.</p>
<p>&nbsp;</p>
<ul>
<li>We see an observation of a misspelled word x</li>
<li>Find the correct word w</li>
</ul>
<p><em>w</em> = argmax P(x|w)P(w)</p>
<p>&nbsp;</p>
<p>From this equation, we can learn that the correct word w is the probability of w misspelled to x, and the probability of word w.</p>
<h3>Candidate generation</h3>
<ol>
<li>
<p> words with similar spelling</p>
</li>
<li>
<p> words with similar pronunciation</p>
</li>
</ol>
<p>&nbsp;</p>
<p>We get candidate list from compute edit distance of string.</p>
<p>&nbsp;</p>
<p>Compute minimal edit distance between two strings, where edits           are:</p>
<ul>
<li>Insertion</li>
<li>Deletion</li>
<li>Substitution</li>
<li>Transposition of two adjacent letters</li>
</ul>
<p><img class="alignnone size-full wp-image-499" title="b5ee9908940b5c8557d965f62ad8c489" src="http://www.wzxue.com/wp-content/uploads/2012/03/b5ee9908940b5c8557d965f62ad8c489.jpeg" alt="" width="820" height="441" data-pinit="registered" /></p>
<p>&nbsp;</p>
<h3>Language Model</h3>
<ul>
<li>Unigram, bigram, trigram</li>
<li>Web-scale spelling correction</li>
<ul>
<li>Stupid back-off</li>
</ul>
</ul>
<p>For example, we can find P(w) for unigram model:</p>
<p><img class="alignnone size-full wp-image-491" title="0b37801a73590fe9cf2a7afe0c338fc9" src="http://www.wzxue.com/wp-content/uploads/2012/03/0b37801a73590fe9cf2a7afe0c338fc9.jpeg" alt="" width="825" height="325" data-pinit="registered" /></p>
<h3>Channel Model</h3>
<p><em>##Edit probability</em></p>
<p>P(x|w) = probability of the edit (deletion/insertion/substitution/transposition)</p>
<p>&nbsp;</p>
<p><em>##Error probability</em></p>
<p>We should compute error probability</p>
<p>How often are above mistake happen.</p>
<p><a href="http://www.wzxue.com/wp-content/uploads/2012/03/5199a999530b83f37b93593bbddb03f0.jpeg"><img class="alignnone size-full wp-image-497" title="5199a999530b83f37b93593bbddb03f0" src="http://www.wzxue.com/wp-content/uploads/2012/03/5199a999530b83f37b93593bbddb03f0.jpeg" alt="" width="830" height="467" data-pinit="registered" /></a></p>
<p>This table is showing the confusion matrix for spelling errors.</p>
<p>&nbsp;</p>
<h3>Generating confusion lists by P(x|w)</h3>
<p><img class="alignnone size-full wp-image-492" title="2a44445fb7f0b7ee0a42f6e99ed7c841" src="http://www.wzxue.com/wp-content/uploads/2012/03/2a44445fb7f0b7ee0a42f6e99ed7c841.jpeg" alt="" width="877" height="393" data-pinit="registered" /></p>
<p>&nbsp;</p>
<h3>Channel model for &#8216;access&#8217;</h3>
<p>&nbsp;</p>
<p><a href="http://www.wzxue.com/wp-content/uploads/2012/03/6e2f9e0a4029992b2d8a0d5331a739ea.jpeg"><img class="alignnone size-full wp-image-493" title="6e2f9e0a4029992b2d8a0d5331a739ea" src="http://www.wzxue.com/wp-content/uploads/2012/03/6e2f9e0a4029992b2d8a0d5331a739ea.jpeg" alt="" width="659" height="410" data-pinit="registered" /></a></p>
<h3>Noisy channel probability for acress</h3>
<p><a href="http://www.wzxue.com/wp-content/uploads/2012/03/90e28d69a1ab69108085e0bd03402839.jpeg"><img class="alignnone size-large wp-image-495" title="90e28d69a1ab69108085e0bd03402839" src="http://www.wzxue.com/wp-content/uploads/2012/03/90e28d69a1ab69108085e0bd03402839-1024x412.jpg" alt="" width="620" height="249" data-pinit="registered" /></a></p>
<p>=======================================</p>
<p>This table primarily show that the process of compute the noisy channel probability for acress.</p>
<ol>
<li>List the candidate words by insertion, deletion, substitution, transposition</li>
<li>Get the P(word) from language model such as unigram model.</li>
<li>From a confusion matrix(the probability of type &#8216;a&#8217; as &#8216;b&#8217;, insert &#8216;a&#8217; before &#8216;b&#8217; etc.), we get P(x|w) for candidate words.</li>
<li>w = P(x|w)*P(w)</li>
</ol>
<p>For more accuracy, we may replace unigram of bigram model.</p>
<p>So the above process may change:</p>
<p>P(word) =&gt; P(word|previous word)</p>
<p>What&#8217;s more, trigram model have more accuracy for the result.</p>
<p>Two corpora are necessary:</p>
<ol>
<li>The probability of the letters in word operation as insertion, deletion, substitute, transposition.</li>
<li>The corpora for the probability of special language model, in detail, the probability of the candidate word show after the previous word or the two previous words.</li>
</ol>
<p>&nbsp;</p>
<h3>For the real-word spell correction</h3>
<p>We don&#8217;t know the wrong word which is in a sentence, we must generate a set of candidates for each word.</p>
<p><a href="http://www.wzxue.com/wp-content/uploads/2012/03/9895c020f46990d375a95a2c5f461275.jpeg"><img class="alignnone size-full wp-image-498" title="9895c020f46990d375a95a2c5f461275" src="http://www.wzxue.com/wp-content/uploads/2012/03/9895c020f46990d375a95a2c5f461275.jpeg" alt="" width="657" height="482" data-pinit="registered" /></a></p>
<ul>
<li>For each word in sentence</li>
<ul>
<li>Generate candidate set</li>
<ul>
<li>the word itself</li>
<li>all single-letter edits that are English words</li>
<li>words that homophones</li>
</ul>
</ul>
<li>Choose best candidates</li>
<ul>
<li>Noisy channel model</li>
<li>Task-specific classifier</li>
</ul>
</ul>
<p><a href="http://www.wzxue.com/wp-content/uploads/2012/03/7aa532b97208148461c4b0b71ec6482d.jpeg"><img class="alignnone size-full wp-image-494" title="7aa532b97208148461c4b0b71ec6482d" src="http://www.wzxue.com/wp-content/uploads/2012/03/7aa532b97208148461c4b0b71ec6482d.jpeg" alt="" width="693" height="296" data-pinit="registered" /></a></p>
<h3>Consider the probability of no error</h3>
<p><a href="http://www.wzxue.com/wp-content/uploads/2012/03/f63eae34b87a6bbff79ae851c0211206.jpeg"><img class="alignnone size-large wp-image-500" title="f63eae34b87a6bbff79ae851c0211206" src="http://www.wzxue.com/wp-content/uploads/2012/03/f63eae34b87a6bbff79ae851c0211206-1024x332.jpg" alt="" width="620" height="201" data-pinit="registered" /></a></p>
<p>The origin word is correct is taking up most probability</p>
<h3>Improvements to channel model</h3>
<ul>
<li>Allow richer edits</li>
<ul>
<li>ent -&gt; ant</li>
<li>ph -&gt; f</li>
<li>le -&gt; al</li>
</ul>
<li>Incorporate pronunciation into channel</li>
<li>Factors that could influence p(misspelling|word)</li>
<ul>
<li>The source letter</li>
<li>The target letter</li>
<li>Surrounding letters</li>
<li>The position in word</li>
<li>Nearby keys on the keyboard</li>
<li>Homology on the keyboard</li>
<li>Pronunciations</li>
<li>Likely morpheme transformations</li>
</ul>
</ul>
<p>&nbsp;</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wzxue.com/490/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Django-tinymce Deploy In Detail</title>
		<link>http://www.wzxue.com/django-tinymce-deploy-in-detail/</link>
		<comments>http://www.wzxue.com/django-tinymce-deploy-in-detail/#comments</comments>
		<pubDate>Wed, 21 Mar 2012 02:52:42 +0000</pubDate>
		<dc:creator>麦子麦</dc:creator>
				<category><![CDATA[IT杂谈]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[网站]]></category>
		<category><![CDATA[delopy]]></category>
		<category><![CDATA[django]]></category>
		<category><![CDATA[tinymce]]></category>

		<guid isPermaLink="false">http://www.wzxue.com/?p=479</guid>
		<description><![CDATA[The last two days, I spent too much time on django-tinymce. At first, I only want to a rich text editor in django-admin pages. I find tinymce is a good solution to. Then I followed the steps by djangoproject. But failed to see the page display richtext editor. I searched and failed. Django-tinymce occupied me, [...]]]></description>
			<content:encoded><![CDATA[<p>The last two days, I spent too much time on django-tinymce.</p>
<p>At first, I only want to a rich text editor in django-admin pages. I find tinymce is a good solution to. Then I followed the steps by <a href="https://code.djangoproject.com/wiki/AddWYSIWYGEditor">djangoproject</a>. But failed to see the page display richtext editor.</p>
<p>I searched and failed.</p>
<p>Django-tinymce occupied me, I deploy it. Unluckly, when I did all where <a href="http://django-tinymce.googlecode.com/svn/tags/release-1.5/docs/.build/html/index.html">document</a> lists. No change the page shows.</p>
<p>I looked up the document again. The configuration said<br />
<code>    TINYMCE_JS_URL (default: settings.MEDIA_URL + 'js/tiny_mce/ tiny_mce.js')<br />
    The URL of the TinyMCE javascript file.<br />
    TINYMCE_JS_ROOT (default: settings.MEDIA_ROOT + 'js/tiny_mce')<br />
    The filesystem location of the TinyMCE files.<br />
    ...</code></p>
<p>Naturally, &#8216;Default&#8217; means no extra setting and program will work as default setting. But there is different.</p>
<p>What a suprise! When I did lazily and copied &#8216;default setting&#8217; to settings.py. It changed, but stranged.</p>
<p>There are some malposition. I looked settings again. And find &#8216;TINYMCE_JS_URL = &#8216;http://debug.example.org/tiny_mce/tiny_mce_src.js” is wrong. Originaly, I think it&#8217;s my fault. But when I see document site again. It&#8217;s examples what document site shows. It absolutly wrong.</p>
<p>Misfortune didn&#8217;t leave me, if you want to use popup window such as insert image, view html plugins. You must update the config &#8216;tiny_mce_popup.js&#8217; and settings.py. But these aren&#8217;t show up in the tutorial.</p>
<p>What a terribel process!</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wzxue.com/django-tinymce-deploy-in-detail/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Link , library And Relocate Address</title>
		<link>http://www.wzxue.com/link-library-and-relocate-address/</link>
		<comments>http://www.wzxue.com/link-library-and-relocate-address/#comments</comments>
		<pubDate>Wed, 07 Mar 2012 08:05:30 +0000</pubDate>
		<dc:creator>麦子麦</dc:creator>
				<category><![CDATA[编译/链接/装载/解释]]></category>
		<category><![CDATA[link]]></category>
		<category><![CDATA[pic]]></category>
		<category><![CDATA[relocate address]]></category>
		<category><![CDATA[shared library]]></category>
		<category><![CDATA[共享库]]></category>
		<category><![CDATA[链接]]></category>

		<guid isPermaLink="false">http://www.wzxue.com/?p=475</guid>
		<description><![CDATA[Static Library Ld build a symbol table and relocate outer modules&#8217;s variables addresses. Because static library is combined with main program, so during linking period the executable file get absolute address. Shared Library Shared library has two link ways. One is decided when loaded into virtual memory, it&#8217;s called load time relocation. Another is via [...]]]></description>
			<content:encoded><![CDATA[<h4>Static Library</h4>
<p>Ld build a symbol table and relocate outer modules&#8217;s variables addresses. Because static library is combined with main program, so during linking period the executable file get absolute address.</p>
<h3>Shared Library</h3>
<p>Shared library has two link ways. One is decided when loaded into virtual memory, it&#8217;s called load time relocation. Another is via PIC(Positon-independent Code) technology decide address until access it.</p>
<ol>
<li>
<p>First, wo will analyze the former way. The executable file when loaded into virtual memory gets the address from shared library. In other words, when module&#8217;s absulote addresses in memory is sured, the system relocates all abosulate addresses.<br />
But there are some issues, if shared library exists instruments are relevance to address, such as when a variable in shared library access other library variable, It may be different in different programs. So it must be different code section in different program&#8217;s shared library. It goes against the purpose of the shared library, and system must copy several code to different program using shared library.</p>
</li>
<li>
<p>PIC technoloy separates code from independent code and dependent-address code. When compiled module, module build .got section, and put .got table&#8217;s address corresponding name which is refer to outer module.<br />
So after compiled, dependent-address instruments in module can get address from .got table. When loaded into memory, .got section get absoute address is the same as the former way. As we know, even if shared library, every program must own a data section copy and .got section exists .data section, so different programs own different .got section, there will be non-conflict address in code.<br />
We can conclude that if module load self&#8217;s function or variables, it&#8217;s easy to understand that module can get abosulate address directly. If access the shared library&#8217;s function or variable, it will get address from .got section. The instrument find address in .got section, then indirectly get the value.</p>
</li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://www.wzxue.com/link-library-and-relocate-address/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Winter Vacation Plan Review</title>
		<link>http://www.wzxue.com/winter-vacation-plan-review/</link>
		<comments>http://www.wzxue.com/winter-vacation-plan-review/#comments</comments>
		<pubDate>Sun, 04 Mar 2012 13:06:25 +0000</pubDate>
		<dc:creator>麦子麦</dc:creator>
				<category><![CDATA[IT杂谈]]></category>
		<category><![CDATA[cocoa]]></category>
		<category><![CDATA[objective-c]]></category>
		<category><![CDATA[weibo sdk]]></category>

		<guid isPermaLink="false">http://www.wzxue.com/?p=472</guid>
		<description><![CDATA[winter vacation goals Finish learning objective-c, cocoa, mac osx develop skills; Read &#60;程序员的自我修养&#62; three chapters; Read &#60;Unix Network Programming&#62; two chapters; Begin Weibo based on Oauth 2.0 client for osx; But some goals don&#8217;t be achieved, Read &#60;Algorithm: C &#62; Sina Weibo Oauth 2.0 Framework I look up the Open Sina, but failed find the [...]]]></description>
			<content:encoded><![CDATA[<h2>winter vacation goals</h2>
<ol>
<li>Finish learning objective-c, cocoa, mac osx develop skills;</li>
<li>Read &lt;程序员的自我修养&gt; three chapters;</li>
<li>Read &lt;Unix Network Programming&gt; two chapters;</li>
<li>Begin Weibo based on Oauth 2.0 client for osx;</li>
</ol>
<p>But some goals don&#8217;t be achieved,</p>
<ol>
<li>Read &lt;Algorithm: C &gt;</li>
</ol>
<h2>Sina Weibo Oauth 2.0 Framework</h2>
<p>I look up the <a href="http://open.sina.com">Open Sina</a>, but failed find the Weibo SDK for Mac osx. Then I find the OACosumer library for Oauth 1.0. After a few hours work, the project convert Oauth 1.0 to Oauth 2.0 failed.</p>
<p>Then I decided to convert Weibo SDK for IOS to osx platform.<br />
Next, I changed the code about IOS, and replace them by relevance code.</p>
<p>Now, I have fininshed most of basic for Weibo client.</p>
<h2>Unix Network Programming indicator</h2>
<p>Recently, I begin to learn about the socket API in C.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wzxue.com/winter-vacation-plan-review/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Cocoa primer &#8212; Three questions about Fastenumeration, dynamicly method, change mutated collection when iterating</title>
		<link>http://www.wzxue.com/cocoa-primer-three-critical-questions-about-fastenumeration-dynamicly-add-class-method-change-mutated-collection-when-iterating/</link>
		<comments>http://www.wzxue.com/cocoa-primer-three-critical-questions-about-fastenumeration-dynamicly-add-class-method-change-mutated-collection-when-iterating/#comments</comments>
		<pubDate>Wed, 11 Jan 2012 12:46:03 +0000</pubDate>
		<dc:creator>麦子麦</dc:creator>
				<category><![CDATA[objective-c]]></category>
		<category><![CDATA[cocoa]]></category>
		<category><![CDATA[iteration]]></category>
		<category><![CDATA[mutated collection]]></category>
		<category><![CDATA[NSFastEnumeration]]></category>

		<guid isPermaLink="false">http://www.wzxue.com/?p=466</guid>
		<description><![CDATA[Today, I will try to write posts in English! Question 1: The requirement that collections not mutate while you are iterating over them can be inconvenient at times, especially if you are doing a “scan the collection and pull out stuff that shouldn’t be there”￼￼￼ operation. What would it take to make a classic array [...]]]></description>
			<content:encoded><![CDATA[<p>Today, I will try to write posts in English!</p>
<h3>Question 1:</h3>
<blockquote><p>
  The requirement that collections not mutate while you are iterating over them can be inconvenient at times, especially if you are doing a “scan the collection and pull out stuff that shouldn’t be there”￼￼￼ operation. What would it take to make a classic array enumerator support having the collection mutated during iteration? What impact would this have on fast enumeration?
</p></blockquote>
<p>While I want to iterate over the mutated collection, it raised exception and error that ***** Terminating app due to uncaught exception &#8216;NSGenericException&#8217;, reason: &#8216;***<br />
 Collection <__NSCFDictionary: 0x100a14fd0> was mutated while being enumerated.&#8217; **</p>
<pre><code>#import &lt;Foundation/Foundation.h&gt;

int main(int argc, const char *argv[])
{
    NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init];
    NSMutableDictionary *myDictionary = [[ NSMutableDictionary alloc] init];
    [myDictionary setObject:@"Keith" forKey:@"name"];
    [myDictionary setObject:@"hacker" forKey:@"subject"];
    [myDictionary setObject:@"4/9/2002" forKey:@"date"];

    NSMutableDictionary *selectionIndex = [NSMutableDictionary dictionaryWithDictionary:myDictionary];
    NSString *whatever = @"999999";
    id keys;
    NSEnumerator *keyEnum = [selectionIndex keyEnumerator];

    while (keys = [keyEnum nextObject])
    {
        [selectionIndex setObject:whatever forKey:keys];
    }
    [pool drain];
    return 0;
}
</code></pre>
<p>So I change the way to mutate collection.</p>
<pre><code>#import &lt;Foundation/Foundation.h&gt;

int main(int argc, const char *argv[])
{
    NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init];
    NSMutableDictionary *myDictionary = [[ NSMutableDictionary alloc] init];
    [myDictionary setObject:@"Keith" forKey:@"name"];
    [myDictionary setObject:@"hacker" forKey:@"subject"];
    [myDictionary setObject:@"4/9/2002" forKey:@"date"];

    NSMutableDictionary *selectionIndex = [NSMutableDictionary dictionaryWithDictionary:myDictionary];
    NSString *whatever = @"999999";

    for (NSString *temp in selectionIndex.allKeys)
    {
        [selectionIndex setObject:whatever forKey:temp];
    }

    for (NSString *temp in selectionIndex.allValues)
    {
        NSLog(@"%@", temp);
    }
    [pool drain];
    return 0;
}
</code></pre>
<p>Above all, we can find that we can&#8217;t make any changes to a mutated collection but we can enumerate the keys of the collection and change it.</p>
<h3>Question 2:</h3>
<blockquote><p>
  How to add a property implementation method at runtime.
</p></blockquote>
<pre><code>#import &lt;Foundation/Foundation.h&gt;
#import &lt;objc/runtime.h&gt;

@interface bird : NSObject
{
    int height;
    float n;
}

@property float n;
@property int height;

@end

float dynamicN(id self, SEL _cmd)
{
    NSString *methodName = NSStringFromSelector(_cmd);
    NSLog(@"%@,%@", methodName, [self description]);
    return ((bird *)self)-&gt;n;
}

void dynamicSetN(id self, SEL _cmd, float sname)
{
    printf("setName start;\n");
    ((bird *)self)-&gt;n = sname;
}

@implementation bird
@synthesize height = height;
@dynamic n;

- (id)init
{
    if (self = [super init]) {
        n = 1.0;
        height = 3;
    }
    return self;
}

+ (BOOL) resolveInstanceMethod:(SEL)aSEL
{
    if (aSEL == @selector(n)) {
        class_addMethod([self class], aSEL, (IMP) dynamicN, "f@:");
        return YES;
    }
    if (aSEL == @selector(setN:)) {
        class_addMethod([self class], aSEL, (IMP) dynamicSetN, "v@:f");
        return YES;
    }
    return [super resolveInstanceMethod:aSEL];
}
@end

int main(int argc, const char *argv[])
{
    NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init];
    bird *aBird = [[bird alloc] init];
    aBird.n = 9;
    printf("\n%f\n,%d", aBird.n, aBird.height);
    [pool drain];
    return 0;
}
</code></pre>
<p>I construct a class called bird, and have two properties &#8216;height&#8217; and &#8216;n&#8217;. I only synthesize height and marked n as dynamic. Use resolveInstanceMethod: to change the dynamic method look up. When others access the bird instance, messaging must through resolveInstanceMethod and look up method.</p>
<p>There is critical point that C-Style function can&#8217;t be dynamic type(id) to access a instance variable but via &#8216;((bird *)self)->n&#8217; to.</p>
<p><a href="http://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/ObjCRuntimeGuide/Articles/ocrtDynamicResolution.html">class_addMethod::::</a> is a low API to change runtime, we can add a method to special class via it. Four parameters respectively are &#8216;classname&#8217;, &#8216;SEL&#8217;, &#8216;function address(IMP)&#8217;, encode.</p>
<h4>Reference: <a href="http://developer.apple.com/library/mac/#documentation/Cocoa/Reference/ObjCRuntimeRef/Reference/reference.html">Objective-C Runtime Reference</a></h4>
<h3>Question 3</h3>
<blockquote><p>
  NSFastEnumeration protocol is must be conformed if objects used in conjunction with the for language construct used in conjunction with Cocoa objects.
</p></blockquote>
<pre><code>#import &lt;Foundation/Foundation.h&gt;

@interface FibonacciSequence : NSObject &lt;NSFastEnumeration&gt;
{
}
@end

@implementation FibonacciSequence

- (NSUInteger) countByEnumeratingWithState: (NSFastEnumerationState *) state
                                   objects: (id *) stackbuf
                                     count: (NSUInteger) len
{
    printf("%d\n", len);
    assert(len &gt;= 2); // because we pre-populate two values on first-call
    id *scan, *stop;
    if (state-&gt;state == 0) {
        // first call, do initializations
        state-&gt;state = 1;
        state-&gt;mutationsPtr = (unsigned long *)self; // not applicable
        state-&gt;itemsPtr = stackbuf; // seed with correct values
        state-&gt;extra[0] = 1;
        state-&gt;extra[1] = 1;
        // fill in the first two values
        state-&gt;itemsPtr[0] = [NSNumber numberWithInt: state-&gt;extra[0]];
        state-&gt;itemsPtr[1] = [NSNumber numberWithInt: state-&gt;extra[1]];
        // tweak the scanning pointers because we've already filled
        // in the first two slots.
        scan = &amp;state-&gt;itemsPtr[2];
        stop = &amp;state-&gt;itemsPtr[0] + len;
    }
    else {
        // Otherwise we're in the Pink, and do normal processing for
        // all of the itemPtrs.
        scan = &amp;state-&gt;itemsPtr[0];
        stop = &amp;state-&gt;itemsPtr[0] + len;
    }
    while (scan &lt; stop) {
        // Do the Fibonacci algorithm.
        int value = state-&gt;extra[0] + state-&gt;extra[1];
        state-&gt;extra[0] = state-&gt;extra[1];
        state-&gt;extra[1] = value;
        // populate the fast enum item pointer
        *scan = [NSNumber numberWithUnsignedLong: value];
        // and then scoot over to the next value
        scan++;
    }
    // Always fill up their stack buffer.
    return len;
}

@end

int main(int argc, const char *argv[])
{
    NSAutoreleasePool *pool =  [[NSAutoreleasePool alloc] init];
    FibonacciSequence *fibby = [[FibonacciSequence alloc] init];
    int boredom = 0;
    for (NSNumber *number in fibby) {
        // NSLog (@"%@", number);
        if (boredom++ &gt; 14) {
            break;
        }
    }
    [pool drain];
    return 0;
}
</code></pre>
<h4>countByEnumeratingWithState:objects:count:</h4>
<p>Return the size of |stackbuf|&#8217;s elements<br />
<strong>parameters</strong></p>
<p>state:<br />
    Context information which is used to hold critical variables when the function is invoked multiple times.</p>
<p>stackbuf:<br />
    A C array of objects passed in which the sender is to iterate.But it is not <strong>required</strong>, you also can store calculated values in other array. The advantage of storing in |stackbuf| is we can learn the max size of special instance storing in |stackbuf|.</p>
<p>len:<br />
    The maximum number of objects to return in |stackbuf|, |countByEnumeratingWithState:objects:count:| according to |len| to judge calculate how much values.</p>
<p><em>|NSFastEnumerationState|</em><br />
    typedef struct {<br />
        unsigned long state;<br />
        id *itemsPtr;<br />
        unsigned long *mutationsPtr;<br />
        unsigned long extra[5];<br />
    } NSFastEnumerationState;</p>
<p>This struct can be hold all the time. It means this struct can communicate with multiple times of function implement as temporary container.</p>
<p><strong>Fileds</strong></p>
<p>state:<br />
    Arbitrary state information used by the iterator. Typically this is set to 0 at the beginning of the iteration.<br />
itemsPtr:</p>
<pre><code>A C array of objects, you can point |itemsPtr| to |stackbuf| or another structures. For example
    for (NSNumber *number in fib) {
        // NSLog (@"%@", number);
        if (boredom++ &gt; 14) {
            break;
        }
    }
During the for loop, |number| get value from |itemsPtr| one by one according to the |countByEnumeratingWithState:objects:count:|'s return value.
</code></pre>
<p>extra:<br />
    A C array that you can use to hold returned values.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wzxue.com/cocoa-primer-three-critical-questions-about-fastenumeration-dynamicly-add-class-method-change-mutated-collection-when-iterating/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>CS 61A Fall 2011 SICP</title>
		<link>http://www.wzxue.com/cs-61a-fall-2011-sicp/</link>
		<comments>http://www.wzxue.com/cs-61a-fall-2011-sicp/#comments</comments>
		<pubDate>Sat, 10 Dec 2011 16:40:37 +0000</pubDate>
		<dc:creator>麦子麦</dc:creator>
				<category><![CDATA[随笔]]></category>
		<category><![CDATA[berkeley]]></category>
		<category><![CDATA[SICP]]></category>

		<guid isPermaLink="false">http://www.wzxue.com/?p=461</guid>
		<description><![CDATA[SICP 终于完成了CS 61A Fall 2011 SICP课程，在十月份的时候开始，一直到今天，差不多与课程同步，然后完成了所有的项目和作业，看完了所有的textbook，收获颇丰。 地址 http://wla.berkeley.edu/~cs61a/fa11/61a-python/content/www/index.html 收获 课程是首次采用Python来代替以前的Scheme教学，用Python的好处是能更快的理解程序，可以进行实践。 从函数式编程到面向对象到MapReduce，从实现列表、对象系统到简单解释器的实现，从网络编程到多线程，幅度之大远非国内高校课程所比。]]></description>
			<content:encoded><![CDATA[<h2>SICP</h2>
<p>终于完成了CS 61A Fall 2011 SICP课程，在十月份的时候开始，一直到今天，差不多与课程同步，然后完成了所有的项目和作业，看完了所有的textbook，收获颇丰。</p>
<h2>地址</h2>
<p>http://wla.berkeley.edu/~cs61a/fa11/61a-python/content/www/index.html</p>
<h2>收获</h2>
<p>课程是首次采用Python来代替以前的Scheme教学，用Python的好处是能更快的理解程序，可以进行实践。<br />
从函数式编程到面向对象到MapReduce，从实现列表、对象系统到简单解释器的实现，从网络编程到多线程，幅度之大远非国内高校课程所比。</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wzxue.com/cs-61a-fall-2011-sicp/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

