5月可以播种北沙参种子吗?

小说:5月可以播种北沙参种子吗?作者:石通华道更新时间:2019-04-20字数:32267

事情便闹大了,东市的金吾卫赶来抢人,两军便在大街上发生了流血冲突。

全国主要的几个瓜子黄杨基地,都为你收集起来了

那一份放佛天地规则一般的睥睨天下的自信,让雅典娜眼中闪过了一丝异彩,雅典娜号称战争和智慧的女神,对于战争,战斗,智慧方面的事情当然极为擅长了,而刘皓的表现让她这个战争和智慧的女神都找不到任何一点不足和错误的地方,这让雅典娜一直对刘皓多次另眼相看。
可是在大殿内寻找了好半天。秦皇宝剑似乎特有一种傲气不与任何神兵相的益彰。正当雪飞鸿要放弃的候。在殿角一个雪飞鸿没有很注意的角落。孤零零的放着一把紫黑色的方天画戟让雪飞鸿感到一种喜见故人般的心灵颤动……此时也不顾秦皇宝剑能不能与方天画戟相的益雪飞鸿赶紧过去看。

手持伏羲刀的伏羲当真是一刀在手天下我有,那锐利无比的刀锋仅仅只是悬浮在虚空却让天道都退避,连混沌都承受不住凭空出了一道裂痕,实在是锋芒毕露,无人能敌,和伏羲的道一样。

平时总用hashmap,tree, set,vector,queue/stack/heap, linklist, graph,是不是觉得数据结构就那点东西。新年到,卿哥给大家分享点段位比较高的大数据专用数据结构--概率数据结构,让你不管是参与系统设计也好,平时和老板同事聊天也好,找工作面试也好都能让人眼前一亮,即probabilistic data structure, 也有人称之为 approximation algorithm 近似算法或者 online algorithm在线算法。今天教大家概率数据结构的5种招式俗称打狗5式,分别是用于基数统计的HyperLogLog, 元素存在检测的Bloom filter, 相似度检测的MinHash, 频率统计的count-min sketch 和 流统计的tdigest,把这打狗5式吃透就足够你闯荡江湖华山论剑啦。

Set cardinality -- HyperLogLog

即基数统计,这是很常用的功能,比如我这个网站这段时间到底被多少独立IP访问过啊?诸如此类需要counting unique的问题。那么正常的思路是什么?建一个hashset往里装,最后返回hashset的结果。可是如果数据量很大,hashset内存就不够了怎么半呢?那就用多台机器memcached或者redis,内存不够了再往硬盘里装,总之就是硬吃。

这个时候其实可以换个思路,牺牲一点准确度来换取内存的节省。这就是HyperLogLog啦!下面的例子用1%的error bound来达到近似结果的目的,效率超级高,内存使用率超级低。下面用https://github.com/svpcom/hyperloglog提供的实现:

 

#!/usr/bin/python

import re

jabber_text = """
`Twas brillig, and the slithy toves 
      Did gyre and gimble in the wabe: 
All mimsy were the borogoves, 
      And the mome raths outgrabe. 

"Beware the Jabberwock, my son! 
      The jaws that bite, the claws that catch! 
Beware the Jubjub bird, and shun 
      The frumious Bandersnatch!" 

He took his vorpal sword in hand; 
      Long time the manxome foe he sought- 
So rested he by the Tumtum tree 
      And stood awhile in thought. 

And, as in uffish thought he stood, 
      The Jabberwock, with eyes of flame, 
Came whiffling through the tulgey wood, 
      And burbled as it came! 

One, two! One, two! And through and through 
      The vorpal blade went snicker-snack! 
He left it dead, and with its head 
      He went galumphing back. 

"And hast thou slain the Jabberwock? 
      Come to my arms, my beamish boy! 
O frabjous day! Callooh! Callay!" 
      He chortled in his joy. 

`Twas brillig, and the slithy toves 
      Did gyre and gimble in the wabe: 
All mimsy were the borogoves, 
      And the mome raths outgrabe.
"""

packer_text = """
My answers are inadequate
To those demanding day and date
And ever set a tiny shock
Through strangers asking what"s o"clock;
Whose days are spent in whittling rhyme-
What"s time to her, or she to Time? 
"""

def clean_words(text):
    return filter(lambda x: len(x) >0, re.sub("[^A-Za-z]", " ", text).split(" "))

jabber_words = clean_words(jabber_text.lower())
#print jabber_words

packer_words = clean_words(packer_text.lower())
#print packer_words

jabber_uniq = sorted(set(jabber_words))
#print jabber_uniq

import hyperloglog

hll = hyperloglog.HyperLogLog(0.01)

for word in jabber_words:
    hll.add(word)

print "prob count %d, true count %d" % (len(hll),len(jabber_uniq))
print "observed error rate %0.2f" % (abs(len(hll) - len(jabber_uniq))/float(len(jabber_uniq)))

打印结果:

prob count 90, true count 91
observed error rate 0.01

Set membership -- Bloom Filter

Bloom Filter是这几种数据结构里你最应该掌握的,如果是读过我“聊聊canssandra”的读者一定耳熟能详。在cassandra的读操作里,如果memtable里没有那么就看bloomfilter,如果bloomfilter说没有就结束了,真没有,如果说有继续查key cache,etc。说没有就没有这个属性太牛逼了。下半句是说有也不一定有但是误差率可以控制为0.001,下面用https://github.com/jaybaird/python-bloomfilter给大家举个例子(上面重复的code我就不写了):

from pybloom import BloomFilter

bf = BloomFilter(capacity=1000, error_rate=0.001)

for word in packer_words:
    bf.add(word)

intersect = set([])

for word in jabber_words:
    if word in bf:
        intersect.add(word)

print intersect

打印结果:

set(["and", "in", "o", "to", "through", "time", "my", "day"])

Set Similarity -- MinHash

就是说两篇文章我来看看它们到底有多相似,听起来可以预防论文抄袭啥的,下面我们通过https://github.com/ekzhu/datasketch的实现来看一下:

from datasketch import MinHash

def mh_digest(data):
    m = MinHash(num_perm=512) #number of permutation

    for d in data:
        m.update(d.encode("utf8"))

    return m

m1 = mh_digest(set(jabber_words))
m2 = mh_digest(set(packer_words))

print "Jaccard simularity %f" % m1.jaccard(m2), "estimated"

s1 = set(jabber_words)
s2 = set(packer_words)
actual_jaccard = float(len(s1.intersection(s2)))/float(len(s1.union(s2)))

print "Jaccard simularity %f" % actual_jaccard, "actual"

打印结果:

Jaccard simularity 0.060547 estimated
Jaccard simularity 0.069565 actual

Frequency Summaries -- count-min sketch

频率统计算法一般用在排行榜中,既当前的????是谁,????是谁,????是谁,我们比较关注和比较发生次数非常多的事件,我们不太关心谁排第n和谁排第n+1,具体说你打枪打了1000环还是1001环这种误差也不重要。还有一个应用实例是写作语言检测,把出现频率最多的一些词取出就能知道你这篇文章是用哪种语言写的。下面通过https://github.com/IsaacHaze/countminsketch来具体看一看:

from collections import Counter
from yacms import CountMinSketch

counts = Counter()

# 200 is hash width in bits, 3 is number of hash functions
cms = CountMinSketch(200, 3)

for word in jabber_words:
    counts[word] += 1
    cms.update(word, 1)

for word in ["the", "he", "and", "that"]:
    print "word %s counts %d" % (word, cms.estimate(word))

for e in counts:
    if counts[e] != cms.estimate(e):
        print "missed %s counter: %d, sketch: %d" % (e, counts[e], cms.estimate(e))

打印结果:

word the counts 19
word he counts 7
word and counts 14
word that counts 2
missed two counter: 2, sketch: 3
missed chortled counter: 1, sketch: 2

Streaming Quantiles -- tdigest

流统计,这个厉害了,假设你有一个超大流数据源源不断,比如是交易数据,现在让你来检测哪些可能是属于信用卡盗刷,那么拿到平均交易价格,最大交易价格就很重要。实时获取这个统计也增大了计算和处理难度,那么t-digest就要闪亮登场了,下面我们看看怎么实时拿到5%头数据,5%尾数据和中间值。下面通过https://github.com/trademob/t-digest来看一下:

from tdigest import TDigest
import random

td = TDigest()

for x in xrange(0, 1000):
    td.add(random.random(), 1)

for q in [0.05, 0.5, 0.95]:
    print "%f @ %f" % (q, td.quantile(q))

打印结果:

0.050000 @ 0.052331
0.500000 @ 0.491775
0.950000 @ 0.955989

好了,今天关于概率数据结构的讲解就到此为止,大家有兴趣可以看看相关代码的具体实现,代码行数都很少,背后蕴藏的数学技巧却很值得品味。

编辑:邓王

发布:2019-04-20 00:41:00

当前文章:http://adsl66.com/play/wagn7xkwzp.html

安徽有种植柳树的基地吗? 陕西哪有贴梗海棠出售? 安吉拉是藤本月季吗? 新疆可以栽植葱兰吗? 凌霄花可以种植在楼顶上吗? 浙江可以栽植八宝景天吗? 你知道红叶碧桃在园林绿化中和谁搭档的最配吗? “金银木”是要金有金,要银有银,要木有木,真是大有来头啊

18154 98180 90756 25698 19835 54883 84176 73704 36064 10535 61248 93528 42392 47138 99049 77105 20155 96291 35785 51524

我要说两句: (0人参与)

发布