使用Pentaho开发BI系统的一些体会
Posted: December 25th, 2009 | Author: mike | Filed under: Bussiness Intelligence | Tags: Bussiness Intelligence, Kettle, Mondrian, OLAP, Pentaho | No Comments »内容比较少,第一次发布到SlideShare,利用WP的插件,看看嵌入Blog的效果如何。
内容比较少,第一次发布到SlideShare,利用WP的插件,看看嵌入Blog的效果如何。
这是一篇早就想写的Blog,之前在CNProg也贴了类似的一个问题,只是一直没有整理好思路,动笔把自己的想法写下来。
自从ASP.NET MVC 1.0正式发布后,基于Microsoft ASP.NET平台的开发方式又多了一个新的选择。MVC不是一个新东西,也不是微软发明的,但是ASP.NET引入MVC框架确实第一次,而且还大大方方地开源了。经历过从静态网页到动态网页变革 – ASP、PHP、JSP那个混战的年代的朋友,肯定已经对MVC再熟悉不过,特别是使用Java或者PHP等语言的朋友,Structs、Phrame框架以及近两年流行的ROR、Django等,它们都是MVC Web框架的杰出代表。现在,微软也加入了对MVC支持的框架开发,对一向习惯使用Web Forms的ASP.NET开发人员,有些人会兴奋,有些人却开始头疼了。
很多迹象表明微软在发展MVC(以下不作说明,MVC即指ASP.NET MVC框架)同时,会继续ASP.NET Web Forms的支持,这就意味着,至少在接下来的几年里,基于微软的Web开发平台,我们至少有两种不同的ASP.NET开发模式可供选择 – 一种是Web Forms,另一种即是MVC。而这两种开发模式本身的区别,对程序员也有着不同的要求。本文就是要讨论这个令一些ASP.NET开发人员觉得头疼的问题 – 在以后的Web开发中,到底选择Web Forms还是MVC?
语言、平台的选择比较是一个仁者见仁的问题,答案不是简单的是或否。关键在到达那个选择点的之前,我们对要作出抉择的双方了解了多少。所以要回答现在这个问题,我想一个好的方式是通过比较两种模式的特点来得出我们的结论。
ASP.NET的横空出世,绝对是动态Web开发史上一次创造性的变革。微软把传统桌面应用程序开发的“事件驱动”理念,用到了ASP.NET上面,开发人员理解了ASP.NET的页面对象模型后,可以在Visual Studio里面像编写Windows程序一样,拖拽服务器控件、用户控件来完成Web应用的设计。页面事件的响应以及服务端代码,通过Code-behind代码模型来处理。ASP.NET同时让页面具有了“状态”,它引入了View state的概念,可以保证在Postback的环境下,让页面和控件能够保证维持的数据不被丢失。Postback也是ASP.NET引入的一种通过提交form到服务端实现的和和服务端响应的机制。所有的这一切,微软为我们在HTTP上抽象了一块很大的“蛋糕”,开发人员可以不懂HTTP/HTML,也可以进行ASP.NET的开发任务。
而另一方面,开源社区和其他供应商开始把Model-View-Conroller设计模式引入到Web开发框架之中。MVC分别代表了应用程序的数据,呈现以及如何操作数据和界面的关系,这种典型的架构模式很好地诠释了Web程序使用的场景。而使用这个模型设计的Web框架也具有更为清晰的程序架构,ASP.NET MVC的架构更为直接地在Web程序中,通过编写不同层次的代码来区别Model,View还有Controller的关系。这种架构直接依赖于无状态的HTTP协议之上,让程序员自己可以完全控制数据的输入输出。
页面对象和事件驱动这两个模型是理解Web Forms的核心,可以理解ASP.NET最初这种设计的苦衷,但是作为Web开发人员,绝对不可以不理解HTTP的Request和Response机制,以及HTML和Javascript。在MVC的开发模型下,显而易见地,我们“纠正”了Web Forms的Postback和View state存在的必要性,也避免了它们存在而导致的种种弊端。
正由于MVC这种松散灵活的架构模式,可以很容易地区分关注点,ASP.NET MVC框架支持TDD的开发方式。而且,MVC的代码都是基于接口(包括基于IHTTPRequest/IHTTPResponse的接口)实现,可以在不运行ASP.NET的前提下,进行单元测试,完全支持不同第三方的单元测试工具。
MVC的设计允许我们对架构进行扩展和接入。比如,我们可以实现自己的视图引擎,地址转发规则等等,而且可以使用IOC容器来进行解耦和优化程序架构。
MVC同时引入了强大的URL Routing组件,能够很方便地设计REST和SEO友好的地址访问。每个地址入口就是一个请求的开始,而不是像Web Forms中需要理解页面和控件的每个事件。
这些特性能够保证在大规模的团队开发模型下,对应用程序的所有层面进行有利的控制。而这些在Web Forms中是有所缺乏的。
Web Forms下面的服务器控件和其他用户控件,围绕事件模型和Postback等机制,为开发人员节省了不少精力去完成这些常见应用的实现。但是在MVC的环境中,每个请求和回馈都是透明的,不再有服务器控件。如果,你已经理解了Web Forms和MVC它们的运行机制的区别,你肯定也能够理解,服务器控件这东西在MVC下面是无法立足的,无需再谈替代方案。
不过,可以看到的是在MVC下,更加容易地使用客户端的控件。无需担心Postback会导致部分数据丢失,也没有Web Forms控件自动生成的“魔法ID“(好消息是ASP.NET4.0会彻底解决这个问题),而且,随着jQuery和众多Javascript类库和控件的流行,客户端在浏览器仍会大放光彩。
如果实在要找替代服务器控件的方案,MVC下面的复用代码将不再是单个控件而已,而是一个个独立的应用程序的接入,我们在前面一个比较中论述了MVC框架这种可扩展性机制的可行性。比如,你正在用MVC开发一个博客程序,为了实现评论功能,你可以直接可以找到一个开源的支持MVC的留言程序,接入你的系统就可以使用了。
虽然MVC较Web Forms开发模式有了很大的不同,但是同在ASP.NET这棵大树下,我们需要知道哪些基础功能仍是可以通用的。至少有以下在Web Forms和MVC中可以共存在技术:
而且,在当前1.0版本的MVC下,同一ASP.NET项目可以同时使用MVC和Web Forms。
如果说Web Forms是微软对Web开发人员一个善意的“谎言”的话,那么MVC则是一次给你“改过自新”的机会。鉴于以上的一些比较,我们在开发一个ASP.NET的应用时候,如果只是完成任务,可能比较过两种开发模式优缺点后,答案也是因地制宜。而某种意义上来讲,MVC是一种更加纯净、开放,给予开发人员更多控制的选择。如果是一名Web开发新手,我推荐采用MVC,在学习的过程中,对于理解HTTP,HTML,Javascript,CSS等基础知识会有着莫大帮助。如果是一名老手,也建议要学用MVC,换个角度看问题,会有新的收获吧。
经过四个晚上的折腾,解决种种问题,又遇到种种新问题。自认为具有不一般的忍耐力的我,最终还是决定放弃迁移个人项目到Google App Engine平台!随之,本勾画好的美好幻境都也成为泡影。
故事从上周末开始..
自从上次sky问我要了gchen.cn的源代码之后,心里一直惦记着要把这个项目的程序改进一些后台功能,并计划能够把之前乱糟糟的代码整理一下,好歹下次有机会给第二个人的时候,不会那么没底气啊:-)。GAE虽然之前也有所关注,并了解有着种种流量等限制,但心想对于我这等小网站,流量啥的定不成大问题,Google这个大靠山应该靠得住的!借着刚写过几个Django项目经验的胆, 便开始进入了GAE这个大魔窟。
首先遇到的问题是,迁移Django项目到GAE,怎么保留尽可能多的代码?
虽然GAE包含了0.61版本的Django,也能够支持其模板,views,url routing等,但是必须要重新写models,因为GAE不用关系数据库,用的是BigTable和GQL查询。再加上,admin也没有了,flatpages,session组件等都不支持(后来发现有如django-patch之类的开源项目,可以帮助实现一些Django失效的功能,但是研究后发现支持的并不彻底,而且并不想引入太多第三方库,过分依赖其他的代码,对于本小项目,实在没有必要)。于是当机立断,就依赖现有GAE的环境,把代码重新写一遍 – 其实models也只有两个类(如果不计要自己设计flatpages和admin的话),说白了,这么小的项目怎么也写不过1k行Python代码吧!
果然很顺手,搭环境,读GAE的文档,不到一个晚上就把models和第一个view写好了,并试了一下memcache很好用,原先的template也可以直接用,似乎一切很顺利!
接下来开始想怎么把之前的flatpages搞进来,还有那个分页-现在的项目都是Django1.0的代码,paginator的类和0.6可是大不一样,考虑以后还有更多的功能实现,在旧的Django上做开发,这哪成?!于是,开始研究怎么让GAE支持新的Django库。
官方网站倒是给了文档,通过Zipimport来支持1.0的Django库,原因是因为GAE的单个app有1k个文件的限制,剥掉文档、测试等文件,Django1.0.2也有900多个文件,app只用1/10空间,这活谁都不会干!Django1.0半年前都已经release了,现在都不支持,看到有人在论坛说因为这个、因为那个。总算,前后试了好几次,终于能让GAE支持Django1.0.2的库,为了方便后来人,这里记录一下需要注意的地方:
django.zip压缩Django项目下的django目录,并且这个目录是根目录。真绕口,大概文件结构是这样:
├─django
│ ├─bin
│ │ └─profiling
│ ├─conf
│ │ ├─app_template
│ │ ├─locale
… …
其次,入口文件(常用main.py)比较重要,特别是这么几行(我的django.zip在项目根目录):
# Add Django 1.0 archive to the path. django_path = 'django.zip' # to avoid importing in every request if not django_path in sys.path: # Uninstall Django 0.96. for k in [k for k in sys.modules if k.startswith('django')]: del sys.modules[k] sys.path.insert(0, django_path) # Must set this env var before importing any part of Django os.environ['DJANGO_SETTINGS_MODULE'] = 'settings' # Force Django to reload its settings. from django.conf import settings # http://code.google.com/appengine/kb/commontasks.html#root_urlconf settings._target = None
注释里很多是后来发现问题后加上的,希望对你有用。
把现有的分页代码加上,运行的貌似很顺利。
更新到服务器上,运行没有问题。目前为止都是测试数据,怎么拿mysql的备份数据更新到服务器呢?
GAE提供了一个bulk_uploader的方案,偶的神呐,这玩意折腾了我一个晚上都没有搞定。开始发现这产品搞出来有点仓促啊,这工具怎么是人用的啊?!
原理大致是,上传的数据做成csv格式,当然只支持逗号隔离和双引号转义,一个表一个csv文件(这里顺带bs一下微软的excel默认的unicode的csv都不支持)。然后要继承一个loader类,这个类用来上传对应的csv文件,所以这个类里要把csv中的每一列名、类型都标注好,另外当然要用到你的model类。然后把csv和这个loader类作为参数执行bulkloader.py,运气好,你就可以搞定了。我就是运气不好啦,这个py是在GAE目录执行的,引用你的app的model需要设置PYTHONPATH,我直接去改了GAE的源文件把我的model给join进去了,虽然认了,还是遇到其他问题。在新闻组里面也有问bulkuploader的人,不多,大概搞成功的都是新项目吧,迁移的人估计很少!最好,历史数据也算了,对我价值不是十分大,但是开始对GAE有点力不从心了。
临睡前,发现服务器上运行时候页面加载比之前慢不少,一查log,发现每个request都在Zipimport django包!我看文档还算仔细的,想起有个release的一条改进就是解决这个问题啊,google也有很多在说used high amount of CPU已经过去时了。但是很抱歉,我又遇到了!当然,先去查是不是自己的问题,main里面注意到了先判断有没有在path中再导入,应该不成问题。但每次request都会使用cpu2000ms的提醒到现在还是存在(并发多时,却不会出现)。加了几个feed数据之后,接着开始出现timeout的操作。无奈,相对不是大问题,先完成其他功能吧。
Google帐户的集成还是比较方便,省去要处理用户权限问题。顺手加了两个form,虽然没有admin免费的东东,自己写一下编辑数据很快搞定。GAE的后台(http://localhost:8080/_ah/admin)是有简单编辑数据的form的,但是有个textproperty一直用input而不是textarea,导致我的多行文本硬是不能搞进去,所以还得自己写。另外,服务器上有方便的log和性能提示,客户端就没有这么好了,调试起来非常麻烦,用pdb有点不一样,我用下面几句来对GAE中代码调试,有帮助的朋友接住:
import sys for attr in ('stdin', 'stdout', 'stderr'): setattr(sys, attr, getattr(sys, '__%s__' % attr)) import pdb pdb.set_trace()
然后,memcache的管理在客户端老是会出问题,要么flush之后还能继续使用,要么那个timeout不起作用。
最后,总结一下。GAE这个东西是蛮好的,尽管依靠Google多年来从搜索引擎到Gmail到在线office各个领域征服了大量用户群和开发人员,GAE在这个巨人肩膀上具有十分优越的高性能的应用基础架构,和一向简洁实用的开发理念,但是恕我愚见,目前这个平台还很不成熟。容易上手,但是开发效率不高,能够提供高性能的回报,却要放弃传统数据库和一些高级编程技巧,受限于各种流量、并发请求等等。尽管Google量身定做了一系列的收费尺度用于商业,作为新一代开发应用平台,在下次尝试前,一个普通的开发人员期望看到有以下改变:
from django.template import loader from django.http import HttpResponse #loads and compiles the template myview_template = loader.get_template('path/to/template.html') def myview(request): # do sth¡ return HttpResponse(myview_template.render( context ))
而不是直接用 render_to_response
Last weekend, I took some time to setup Python and Django on our hosting server which is bought from Bluehost. I also bound a domain name to my working space afterwards. Here is a walk-though for all work need to do.
# .htaccess file
AddHandler fcgid-script .fcgi
RewriteEngine On
RewriteBase /
RewriteRule ^(media/.*)$ - [L]
RewriteRule ^(static/.*)$ - [L]
RewriteCond %{REQUEST_URI} !(dispatch.fcgi)
RewriteRule ^(.*)$ dispatch.fcgi/$1 [L]
# dispatch.fcgi file
#!/home/USERNAME/python/bin/python
import sys, os
# Add a custom Python path.
sys.path.insert(0, "/home/USERNAME/python")
sys.path.insert(0, "/home/USERNAME/Django-1.0")
os.chdir("/home/USERNAME/Django-1.0/PROJECTNAME")
os.environ['DJANGO_SETTINGS_MODULE'] = "PROJECTNAME.settings"
from django.core.servers.fastcgi import runfastcgi
runfastcgi(["method=threaded", "daemonize=false"])
You need to set proper rights(chmod +x) for dispatch.fcgi file. In dispatch.fcgi file, you need to replace USERNAME and PROJECTNAME with your own names. "Django-1.0" is my folder where my Django installed.
And do remember that place your project folder under your Django installation folder.(It kept telling me Unhandled Exception error before I figured this out).
Besides, you probably want to confirm FastCGI is installed from your hosting settings, I tried that also but I didn’t find it. So far as I know, FastCGI has been enabled by default on Bluehost. And the file name "dispath.fcgi" is from somewhere, somebody’s experience that corn jobs on server will search and kill suspended processes without these names.
Start project
It should be ready so far. Go to your Django folder,and rock your project with the name specified before:
django-admin.py startproject PROJECTNAME
It’s quite easy to add a new domain in cPanel. Just add a new addon domain and match the domain folder with the one where you place .htaccess and dispatch.fcgi files. Copy the A record IP address and save it to your IP settings. Restart VDNS if you have that. You can also use the subdomain you just created from cPanel. That’s all!
If you meet problems do check thing below:
Hope this helps for you.
Dictionary is one of my most wanted tools when I work on an operation system. Either Windows or Macintosh. When I moved to work on Mac for a while, I found really miss dictionary tools on Windows, where I can get many choices.
I know there is a built-in dictionary with Apple OS X, however, I didn’t realize it was useful in the beginning. By first impression, there is no settings for adding a new dictionary, which means I can not use that with my native language – Chinese. And further more, I did not find it supports to translate text by cursor point or selection while I am reading.
I searched a lot on internet, and tried some dictionaries without any success. Then I realize it’s really a chance for making a new dictionary product by myself. However, I don’t have any experiences with Mac Dev yet. Then I sent some questions about the dictionary development on Mac Dev mail-list and got some replies which made me want to review the built-in dictionary again.
First of all, the built-in dictionary can support reading mode – immediate translation by pointing at the word on screen. After I read that in help document, it approves this tool is much more than what I thought. Here are the samples when you read emails in Mail program:
Try to right-click or left-click while press CTRL at the same time on any word. Then you can see “Look up in Dictionary” in context menu, and it will pop up the dictionary program and show the translations after you select it.
If this is all about the hints, then I won’t write this blog. The trick is by selecting one word, and press CTRL+CMD+D. And you will see a small panel showing the translations just below the word. That’s really awesome.

After the panel displays the first time, when you move your cursor to other words, it is still in working mode before we click on window.
The shortcuts can be changes as many others in Mac system. You can go to System Preferences – Keyboard & Mouse – Keyboard Shortcuts, and find the one of “Look up in Dictionary”, changes to whatever you want.
However, the reading mode is only supported on programs developed in Cocoa code. That means you can use all tricks in Safari, but not in Firefox. But you can always find whatever you want from Firefox addons huge community fans. I found one here which can support the context menu and pop up dictionary window.
Until now, it is pretty cool and I have the same functions as those I get used on Windows. But I still can not use it to look up Chinese words. Then I found one tool from Google code and that solves my whole problems. This tool can convert StarDict dictionaries to Mac built-in dictionary. And I get enough useful dictionaries from StarDict, which includes the one I used often on Windows.

I want to add two more tips here for this handy dictionary. One is you can set to use panel mode in its prefrences settings, then it displays the cool panel instead of popup window when you look up a word from context menu.
The other is you can look up and open dictionary from any browser by using “dict” protocal. For instance, if you want to look up the word of “gorgeous”, you can input this line in your Firefox address bar:
“dict://gergeous”
Software environment:
Leopard 10.5.4, Apache 2.2.8, Python 2.5, MacPort 1.6, Subversion 1.4.4
Install mod_python for Apache
Note: Below is going to log my installation steps which failed, if you need to install mod_python without problems I write below, please go to the UPDATE block in bottom.
I installed mod_python by downloading source files. It also can be installed from MacPorts.
Before you install it, please make sure you have Apache and Python installed already(They are parts of Leopard anyway, if you haven’t made changes to system.)
./configure
make
sudo make install
And open http.conf, add one line to load mod_python module to Apache:
LoadModule python_module libexec/mod_python.so
The actual path for this module file could be different as on your machine, but you can find the right one from the response of make install command.
After all above changes, you may try to restart Apache service, but you will encounter errors here:$ sudo apachectl -k restarthttpd: Syntax error on line 116 of /private/etc/apache2/httpd.conf: Cannot load /usr/libexec/apache2/mod_python.so into server: dlopen(/usr/libexec/apache2/mod_python.so, 10): no suitable image found. Did find:\n\t/usr/libexec/apache2/mod_python.so: no matching architecture in universal wrapper
After I post this error to mail list of mod_python, and I was told that was 32/64 bit architecture match issue. More details can be found here.
UPDATE:
You need to get latest source files by SVN and re-install again:
svn co http://svn.apache.org/repos/asf/quetzalcoatl/mod_python/trunk mod_python-trunk
References:
Install MySQL
This manual is going to follow MySQL5.0 installation though MacPorts on MacOSX 10.5(Leopard). If you have different versions of MySQL or installed by any other build without MacPorts, please refer to MySQL official documentation.
Firstly, make sure you already updated your .profile to access MacPorts easily in shell. Such as:
# Setting PATH for MacPort
export PATH=$PATH:/opt/local/bin:/opt/local/sbin
Now you can directly use PORT and other commands(For instance, some MySQL commands below. All shell commands start after $, and some of them are with comments above which start with #)
# install MySQL in server mode, this will create a launch item for MySQL to start with server
$ sudo port install mysql5 +server
# initialize system databases
$ sudo mysql_install_db5
# grant permissions for mysql account
$ sudo chgrp -R mysql /opt/local
$ sudo chown -R mysql /opt/local
# start mysql service with mysql account
$ sudo mysqld_safe5 –user=mysql &
# update root’s password
$ mysqladmin5 -u root password root
# login in with root account
$ mysql5 -u root -p
References:
Install MySQLdb
This package is for Python MySQL database bindings. The source files can be downloaded from project webpage with latest version 1.2.2. However, I failed to get it installed from MacPorts, the Python was installed instead when i try to pick up py25-mysql package.
Note: There is one error when you install this package. You need to update one source file, and more details and installation steps can be found here.
References:
http://projectmouse.org/2013/InstallingDjangoforLeopardwithMySQLSupport
http://sourceforge.net/forum/forum.php?thread_id=1718937&forum_id=70461
Install django
Django is still old version from search in MacPorts (0.96.1). So I decided to get SVN source files.
svn co http://code.djangoproject.com/svn/django/trunk/
Create a symbolic link and make sure Python can load Django’s code.
ln -s `pwd`/django-trunk/django SITE-PACKAGES-DIR/django
(SITE-PACKAGES-DIR can be found by running below command: python -c “from distutils.sysconfig import get_python_lib; print get_python_lib()”)
In the last, create one more link to django-trunk/django/bin/django-admin.py file.ln -s `pwd`/django-trunk/django/bin/django-admin.py /usr/local/bin(Last two steps have alternative way which can set PATH to system)
References:
Python到底有多简洁、高效?
1 import glob 2 import os 3 4 files = glob.glob('/cygdrive/c/Users/Mike/Desktop/Data Export/cube_export_*.txt') 5 for file in files: 6 #get file path and file name 7 (filePath, fileName) = os.path.split(file) 8 rfile = open(file, "r") 9 wfile = open('%s\\%s%s' %(filePath, fileName, "_calculations.txt"), "w") 10 11 #get the content needed in text and write to new filE 12 for line in rfile: 13 if line.find("Statement:") > -1: 14 wfile.write(line.lstrip("Statement:").rstrip()) 15 wfile.write(";\r\n") 16 wfile.close() 17 rfile.close()
上面的代码完成了以下工作:
下面是操作数据库的实例,我把一个表中的几个字段拼接插入到另一个表中。这里只展现了Python在格式化字符串的灵活之处。
如果用list来做,效果可能更佳。但是通过两个列子,足以看到Python的魅力所在。
1 import pymssql 2 3 conn = pymssql.connect(host='CDCVM4505', user='sa', password='sysadm', database='es0617') 4 cur = conn.cursor() 5 cur2 = conn.cursor() 6 7 query = "SELECT MemberID, Comment_Id FROM [Cell_Comment_Key_Budget2008_V2_1] ORDER BY Comment_ID" 8 cur.execute(query) 9 for x in range(cur.rowcount/10): 10 new_key = ";".join(["%s" %mid for mid, cid in cur.fetchmany(10)]) 11 query = "UPDATE [Cell_Comment_Budget2008_V2_1] SET Comment_Key = '%s' WHERE Comment_Id = %s" %(new_key, cid) 12 res = cur2.execute(query) 13 14 conn.commit() 15 conn.close()
This article is inspirited by Scott Hanselman and without any guarantee to be updated relevantly at times.
Last updated by Mike – Vejle, Denmark. June 1st, 2008.
Recent Comments