[转]机器学习算法基础概念学习总结

发表于 2015年02月15日 | 分类于数据挖掘，概述 |

1.基础概念：

(1) 10折交叉验证：英文名是10-fold cross-validation，用来测试算法的准确性。是常用的测试方法。将数据集分成10份。轮流将其中的9份作为训练数据，1分作为测试数据，进行试验。每次试验都会得出相应的正确率（或差错率）。10次的结果的正确率（或差错率）的平均值作为对算法精度的估计，一般还需要进行多次10折交叉验证，在求其平均值，对算法的准确性进行估计。

(2) 极大似然估计：极大似然估计，只是一种概率论在统计学中的应用，它是参数评估的方法之一。说的已知某个随机样本满足某种概率分布，但是其中具体的参数不清楚，参数估计通过若干次实验，观察其结果，利用结果推出参数的大概值。极大似然估计是建立在这样的思想上的：已知某个参数能使这个样本出现的概率最大。我们当然不会再去选择其他其他小概率的样本，所以干脆就把这个参数作为估计的真实值。

(3) 在信息论中，熵表示的是不确定性的量度。信息论的创始人香农在其著作《通信的数学理论》中提出了建立在概率统计模型上的信息度量。他把信息定义为”用来消除不确定性的东西“。熵的定义为信息的期望值。

ps:熵指的是体系的混乱程度，它在控制论，概率论，数论，天体物理，生命科学等领域都有重要的应用，在不同的学科中也有引申出更为具体的定义，是各个领域十分重要的参量。熵由鲁道夫.克劳修斯提出，并应用在热力学中。后来在，克劳德.埃尔伍德.香农第一次将熵的概念引入到信息论中来。

阅读全文 »

[转]为什么linux下多线程程序如此消耗虚拟内存

发表于 2015年02月14日 | 分类于系统优化 |

最近游戏已上线运营，进行服务器内存优化，发现一个非常奇妙的问题，我们的认证服务器（AuthServer）负责跟第三方渠道SDK打交道（登陆和充值），由于采用了curl阻塞的方式，所以这里开了128个线程，奇怪的是每次刚启动的时候占用的虚拟内存在2.3G，然后每次处理消息就增加64M，增加到4.4G就不再增加了，由于我们采用预分配的方式，在线程内部根本没有大块分内存，那么这些内存到底是从哪来的呢？让人百思不得其解。

1.探索

一开始首先排除掉内存泄露，不可能每次都泄露64M内存这么巧合，为了证明我的观点，首先，我使用了valgrind。

valgrind --leak-check=full --track-fds=yes --log-file=./AuthServer.vlog &

然后启动测试，跑至内存不再增加，果然valgrind显示没有任何内存泄露。反复试验了很多次，结果都是这样。

在多次使用valgrind无果以后，我开始怀疑程序内部是不是用到mmap之类的调用,于是使用strace对mmap,brk等系统函数的检测：

strace -f -e"brk,mmap,munmap" -p $(pidof AuthServer)

其结果如下：

[pid 19343] mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f53c8ca9000
[pid 19343] munmap(0x7f53c8ca9000, 53833728) = 0
[pid 19343] munmap(0x7f53d0000000, 13275136) = 0
[pid 19343] mmap(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f53d04a8000
Process 19495 attached
````  

<!-- more -->
  
我检查了一下trace文件也没有发现大量内存mmap动作，即便是brk动作引起的内存增长也不大。于是感觉人生都没有方向了，然后怀疑是不是文件缓存把虚拟内存占掉了，注释掉了代码中所有读写日志的代码，虚拟内存依然增加，排除了这个可能。

## 2.灵光一现

后来，我开始减少thread的数量开始测试，在测试的时候偶然发现一个很奇怪的现象。那就是如果进程创建了一个线程并且在该线程内分配一个很小的内存1k，整个进程虚拟内存立马增加64M，然后再分配，内存就不增加了。测试代码如下：

#include

#include
using namespace std;

volatile bool start = 0;

void thread_run( void )
{
while ( 1 )
{
if ( start )
{
cout << “Thread malloc” << endl;
char *buf = new char[1024];
start = 0;
}
sleep( 1 );
}
}

int main()
{
pthread_t th;

getchar();
getchar();
pthread_create( &th, 0, thread_run, 0 );

while ( (getchar() ) )
{
    start = 1;
}

return(0);

}
````

其运行结果如下图，刚开始时，进程占用虚拟内存14M，输入0，创建子线程，进程内存达到23M，这增加的10M是线程堆栈的大小（查看和设置线程堆栈大小可用ulimit –s），第一次输入1，程序分配1k内存，整个进程增加64M虚拟内存，之后再输入2，3，各再次分配1k，内存均不再变化。

这个结果让我欣喜若狂，由于以前学习过谷歌的Tcmalloc，其中每个线程都有自己的缓冲区来解决多线程内存分配的竞争，估计新版的glibc同样学习了这个技巧，于是查看pmap $(pidof main) 查看内存情况，如下：

请注意65404这一行，种种迹象表明，这个再加上它上面那一行（在这里是132）就是增加的那个64M）。后来增加thread的数量，就会有新增thread数量相应的65404的内存块。

3.刨根问底

经过一番google和代码查看。终于知道了原来是glibc的malloc在这里捣鬼。glibc 版本大于2.11的都会有这个问题：在redhat 的官方文档上：https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/6.0_Release_Notes/compiler.html

Red Hat Enterprise Linux 6 features version 2.11 of glibc, providing many features and enhancements, including… An enhanced dynamic memory allocation (malloc) behaviour enabling higher scalability across many sockets and cores.This is achieved by assigning threads their own memory pools and by avoiding locking in some situations. The amount of additional memory used for the memory pools (if any) can be controlled using the environment variables MALLOC_ARENA_TEST and MALLOC_ARENA_MAX. MALLOC_ARENA_TEST specifies that a test for the number of cores is performed once the number of memory pools reaches this value. MALLOC_ARENA_MAX sets the maximum number of memory pools used, regardless of the number of cores.

The developer, Ulrich Drepper, has a much deeper explanation on his blog:http://udrepper.livejournal.com/20948.html

Before, malloc tried to emulate a per-core memory pool. Every time when contention for all existing memory pools was detected a new pool is created. Threads stay with the last used pool if possible… This never worked 100% because a thread can be descheduled while executing a malloc call. When some other thread tries to use the memory pool used in the call it would detect contention. A second problem is that if multiple threads on multiple core/sockets happily use malloc without contention memory from the same pool is used by different cores/on different sockets. This can lead to false sharing and definitely additional cross traffic because of the meta information updates. There are more potential problems not worth going into here in detail.

The changes which are in glibc now create per-thread memory pools. This can eliminate false sharing in most cases. The meta data is usually accessed only in one thread (which hopefully doesn’t get migrated off its assigned core). To prevent the memory handling from blowing up the address space use too much the number of memory pools is capped. By default we create up to two memory pools per core on 32-bit machines and up to eight memory per core on 64-bit machines. The code delays testing for the number of cores (which is not cheap, we have to read /proc/stat) until there are already two or eight memory pools allocated, respectively.

While these changes might increase the number of memory pools which are created (and thus increase the address space they use) the number can be controlled. Because using the old mechanism there could be a new pool being created whenever there are collisions the total number could in theory be higher. Unlikely but true, so the new mechanism is more predictable.

… Memory use is not that much of a premium anymore and most of the memory pool doesn’t actually require memory until it is used, only address space… We have done internally some measurements of the effects of the new implementation and they can be quite dramatic.

New versions of glibc present in RHEL6 include a new arena allocator design. In several clusters we’ve seen this new allocator cause huge amounts of virtual memory to be used, since when multiple threads perform allocations, they each get their own memory arena. On a 64-bit system, these arenas are 64M mappings, and the maximum number of arenas is 8 times the number of cores. We’ve observed a DN process using 14GB of vmem for only 300M of resident set. This causes all kinds of nasty issues for obvious reasons.

Setting MALLOC_ARENA_MAX to a low number will restrict the number of memory arenas and bound the virtual memory, with no noticeable downside in performance – we’ve been recommending MALLOC_ARENA_MAX=4. We should set this in hadoop-env.sh to avoid this issue as RHEL6 becomes more and more common.

总结一下，glibc为了分配内存的性能的问题，使用了很多叫做arena的memory pool,缺省配置在64bit下面是每一个arena为64M，一个进程可以最多有 cores * 8个arena。假设你的机器是4核的，那么最多可以有4 * 8 = 32个arena，也就是使用32 * 64 = 2048M内存。当然你也可以通过设置环境变量来改变arena的数量.例如export MALLOC_ARENA_MAX=1

hadoop推荐把这个值设置为4。当然了，既然是多核的机器，而arena的引进是为了解决多线程内存分配竞争的问题，那么设置为cpu核的数量估计也是一个不错的选择。设置这个值以后最好能对你的程序做一下压力测试，用以看看改变arena的数量是否会对程序的性能有影响。

mallopt(M_ARENA_MAX, xxx)如果你打算在程序代码中来设置这个东西，那么可以调用mallopt(M_ARENA_MAX, xxx)来实现，由于我们AuthServer采用了预分配的方式，在各个线程内并没有分配内存，所以不需要这种优化，在初始化的时候采用mallopt(M_ARENA_MAX, 1)将其关掉，设置为0，表示系统按CPU进行自动设置。

4.意外发现

想到tcmalloc小对象才从线程自己的内存池分配，大内存仍然从中央分配区分配，不知道glibc是如何设计的，于是将上面程序中线程每次分配的内存从1k调整为1M，果然不出所料，再分配完64M后，仍然每次都会增加1M，由此可见，新版 glibc完全借鉴了tcmalloc的思想。

忙了几天的问题终于解决了，心情大好，通过今天的问题让我知道，作为一个服务器程序员，如果不懂编译器和操作系统内核，是完全不合格的，以后要加强这方面的学习。

转自：http://blog.jobbole.com/83878/

[转]数据挖掘算法学习（七）SVM算法

发表于 2015年02月14日 | 分类于数据挖掘，分类 |

SVM，支持向量机。数据挖掘中的一个经典算法，博主学了挺久，把学到的一些东西跟大家分享一下。

支持向量机(SVM,Support Vector Machine)是在高维特征空间使用线性函数假设空间的学习系统,它由一个来自最优化理论的学习算法训练,该算法实现了一个由统计学习理论到处的学习偏置.此学习策略由Vapnik和他的合作者提出,是一个准则性的并且强有力的方法.在它提出来的若干年来,在范围广大的应用中,SVM的性能胜过其他大多数的学习系统。

一、主要思想

建立一个最优决策超平面，使得该平面两侧距离平面最近的两类样本之间的距离最大化，从而对分类问题提供良好的泛化能力。说白了就是：当样本点的分布无法用一条直线或几条直线分开时（即线性不可分）SVM提供一种算法，求出一个曲面用于划分。这个曲面，就称为最优决策超平面。而且，SVM采用二次优化，因此最优解是唯一的，且为全局最优。前面提到的距离最大化就是说，这个曲面让不同分类的样本点距离最远，即求最优分类超平面等价于求最大间隔。

放一张图直观感受下超平面。

阅读全文 »

[转]Spark MLlib系列(二):基于协同过滤的电影推荐系统

发表于 2015年02月14日 | 分类于 hadoop |

前言

随着大数据时代的到来，数据当中挖取金子的工作越来越有吸引力。利用Spark在内存迭代运算、机器学习领域强悍性能的优势，使用spark处理数据挖掘问题就显得很有实际价值。这篇文章给大家分享一个spark MLlib 的推荐实战例子。我将会分享怎样用spark MLlib做一个电影评分的推荐系统。使用到的算法是user-based协同过滤。如果对Spark MLlib不太了解的，请阅读我的上一篇博客。

[转]海量数据相似度计算之simhash短文本查找

发表于 2015年02月13日 | 分类于算法 |

在前一篇文章《海量数据相似度计算之simhash和海明距离》介绍了simhash的原理，大家应该感觉到了算法的魅力。但是随着业务的增长simhash的数据也会暴增，如果一天100w，10天就1000w了。我们如果插入一条数据就要去比较1000w次的simhash，计算量还是蛮大，普通PC比较1000w次海明距离需要 300ms ，和5000w数据比较需要1.8 s。看起来相似度计算不是很慢，还在秒级别。给大家算一笔账就知道了：

随着业务增长需要一个小时处理100w次，一个小时为3600 *1000 = 360w毫秒，计算一下一次相似度比较最多只能消耗 360w / 100w = 3.6毫秒。300ms慢吗，慢！1.8S慢吗，太慢了！很多情况大家想的就是升级、增加机器，但有些时候光是增加机器已经解决不了问题了，就算增加机器也不是短时间能够解决的，需要考虑分布式、客户预算、问题解决的容忍时间？头大时候要相信人类的智慧是无穷的，泡杯茶，听下轻音乐：）畅想下宇宙有多大，宇宙外面还有什么东西，程序员有什么问题能够难倒呢？

阅读全文 »

1.基础概念：

1.探索

3.刨根问底

4.意外发现

前言

推荐系统的对比