数据挖掘 - 查询语言

数据挖掘查询语言（DMQL）是由 Han、Fu、Wang 等人提出的。用于 DBMiner 数据挖掘系统。数据挖掘查询语言实际上是基于结构化查询语言（SQL）的。数据挖掘查询语言可以设计为支持临时和交互式数据挖掘。此 DMQL 提供用于指定原语的命令。DMQL 也可以与数据库和数据仓库一起使用。DMQL 可用于定义数据挖掘任务。我们特别研究如何在 DMQL 中定义数据仓库和数据集市。

任务相关数据规范的语法

以下是用于指定任务相关数据的 DMQL 语法 -

use database database_name

or 

use data warehouse data_warehouse_name
in relevance to att_or_dim_list
from relation(s)/cube(s) [where condition]
order by order_list
group by grouping_list

指定知识类型的语法

在这里，我们将讨论表征、区分、关联、分类和预测的语法。

表征

表征的语法是 -

mine characteristics [as pattern_name]
   analyze  {measure(s) }

分析子句指定聚合度量，例如计数、总和或计数%。

例如 -

Description describing customer purchasing habits.
mine characteristics as customerPurchasing
analyze count%

歧视

歧视的语法是 -

mine comparison [as {pattern_name]}
For {target_class } where  {t arget_condition } 
{versus  {contrast_class_i }
where {contrast_condition_i}}  
analyze  {measure(s) }

例如，用户可以将大手笔消费者定义为购买平均价格为 100 美元或以上的商品的客户；预算消费者是指平均购买商品价格低于 100 美元的顾客。从每个类别中挖掘客户的判别描述可以在 DMQL 中指定为 -

mine comparison as purchaseGroups
for bigSpenders where avg(I.price) ≥$100
versus budgetSpenders where avg(I.price)< $100
analyze count

协会

关联的语法是 -

mine associations [ as {pattern_name} ]
{matching {metapattern} }

例如 -

mine associations as buyingHabits
matching P(X:customer,W) ^ Q(X,Y) ≥ buys(X,Z)

其中 X 是客户关系的关键；P和Q是谓词变量；W、Y 和 Z 是对象变量。

分类

分类的语法是 -

mine classification [as pattern_name]
analyze classifying_attribute_or_dimension

例如，为了挖掘模式，对客户信用评级进行分类，其中类别由属性credit_ rating确定，并且挖掘分类被确定为classifyCustomerCreditRating。

analyze credit_rating

预言

预测的语法是 -

mine prediction [as pattern_name]
analyze prediction_attribute_or_dimension
{set {attribute_or_dimension_i= value_i}}

概念层次规范的语法

要指定概念层次结构，请使用以下语法 -

use hierarchy <hierarchy> for <attribute_or_dimension>

我们使用不同的语法来定义不同类型的层次结构，例如：

-schema hierarchies
define hierarchy time_hierarchy on date as [date,month quarter,year]
-
set-grouping hierarchies
define hierarchy age_hierarchy for age on customer as
level1: {young, middle_aged, senior} < level0: all
level2: {20, ..., 39} < level1: young
level3: {40, ..., 59} < level1: middle_aged
level4: {60, ..., 89} < level1: senior

-operation-derived hierarchies
define hierarchy age_hierarchy  for age  on customer  as
{age_category(1), ..., age_category(5)} 
:= cluster(default, age, 5) < all(age)

-rule-based hierarchies
define hierarchy profit_margin_hierarchy  on item  as
level_1: low_profit_margin < level_0:  all

if (price - cost)< $50
   level_1:  medium-profit_margin < level_0:  all
   
if ((price - cost) > $50)  and ((price - cost) ≤ $250)) 
   level_1:  high_profit_margin < level_0:  all

兴趣度度量语法规范

用户可以使用以下语句指定兴趣度度量和阈值 -

with <interest_measure_name>  threshold = threshold_value

例如 -

with support threshold = 0.05
with confidence threshold = 0.7

模式表示和可视化规范的语法

我们有一种语法，允许用户指定以一种或多种形式显示发现的模式。

display as <result_form>

例如 -

display as table

DMQL 的完整规范

作为一家公司的市场经理，您希望描述能够购买价格不低于 100 美元的商品的客户的购买习惯；考虑到顾客的年龄、购买的商品类型以及购买商品的地点。您想知道具有该特征的客户的百分比。特别是，您只对在加拿大购买并使用美国运通信用卡付款的商品感兴趣。您希望以表格的形式查看结果描述。

use database AllElectronics_db
use hierarchy location_hierarchy for B.address
mine characteristics as customerPurchasing
analyze count%
in relevance to C.age,I.type,I.place_made
from customer C, item I, purchase P, items_sold S,  branch B
where I.item_ID = S.item_ID and P.cust_ID = C.cust_ID and
P.method_paid = "AmEx" and B.address = "Canada" and I.price ≥ 100
with noise threshold = 5%
display as table

数据挖掘语言标准化

标准化数据挖掘语言将达到以下目的 -

帮助系统开发数据挖掘解决方案。
提高多个数据挖掘系统和功能之间的互操作性。
促进教育和快速学习。
促进数据挖掘系统在工业和社会中的使用。