问题描述
对于像这样的数据模型:
For data model like:
create type city (
name text,
code int
);
create table user (
id uuid,
name text,
cities list<FROZEN<city>>,
primary key ( id )
);
create index user_city_index on user(cities);
查询为
select id, cities from user where cities contains {name:'My City', code: 10};
工作正常。但是可以查询
is working fine. But is it possible to query
select id, cities from user where cities contains {name:'My City'};
并丢弃 code
属性,即 code =< any>
?
可以利用Spark来实现吗?
Can this be achieved with the utilization of Spark?
推荐答案
不,不是。关于状态(对于UDT列 name
):
No, it is not. The documentation on using a UDT states (for a UDT column name
):
- 在用户定义类型的列上过滤数据。创建一个索引,然后运行条件查询。在Cassandra 2.1.x中,您需要在
位置$ c $中列出
名称
列的所有组件 c>子句。
- Filter data on a column of a user-defined type. Create an index and then run a conditional query. In Cassandra 2.1.x, you need to list all components of the
name
column in theWHERE
clause.
因此查询您的城市
UDT收集将需要 city
类型的所有组件。
So querying your cities
UDT collection will require all components of the city
type.
我敢肯定有一种查询方法这在Spark中,但我给您一个基于Cassandra的答案。基本上,创建一个附加的定义/索引列表列仅用于保存城市名称列表,然后在其上运行 CONTAINS
。甚至更好的是,将PRIMARY KEY定义将 city
类型标准化为查询表( usersbycity
),例如code> PRIMARY KEY(城市名称,城市代码,用户ID),并在您的 user
表之外使用它来支持按城市名称和代码(或只是城市名称)。
I'm sure there's a way to query this in Spark, but I'll give you a Cassandra based answer. Basically, create an additional list column defined/indexed just to hold the list of city names, and run your CONTAINS
on that. Even better, would be to denormalize the city
type into a query table (usersbycity
) with a PRIMARY KEY definition like PRIMARY KEY(cityname, citycode, userid)
and use that in addition to your user
table to support queries by city name and code (or just city name).
请记住,当表是专门为适合您的查询模式而设计时,Cassandra的工作效果最佳。次要索引是为了方便而不是性能。尝试扩充一个表以支持多个查询是一种RDBMs数据建模方法(在Cassandra中通常无法很好地工作)。而不是一个不能很好地满足一个查询的表,您通常会得到一个不能很好地满足多个查询的表。
Remember, Cassandra works best when the tables are specifically designed to suit your query patterns. Secondary indexes are meant for convenience, not performance. Trying to augment one table to support multiple queries is a RDBMs approach to data modeling (which typically doesn't work well in Cassandra). And instead of one table that serves one query well, you usually end up with one table that serves multiple queries poorly.
编辑您的问题:
1)具有长的聚簇键是否可以接受?
1) "Is it acceptable to have long clustering keys?"
目前我无法找到明确的说明,但是我认为这里更大的问题是如何在幕后存储/使用集群密钥。本质上,每个聚类键值都附加到每个列值上(以便更快地检索)。显然,如果您有很多存储空间,那将会占用磁盘空间(这些天不是一个太大的问题...如果您可以使用 COMPACT STORAGE
指令。)
I cannot find a definitive statement on this at the moment, but I think the bigger issue here is in how clustering keys are stored/used "under the hood." Essentially, each clustering key value is appended to each column value (for quicker retrieval). Obviously, if you have a lot of them, that's going to eat disk space (not too big of a concern these days...if it is you can counter that with the COMPACT STORAGE
directive).
如果其中有很多,最终可能会影响性能。我可以仔细检查一遍,然后再回信给您。我不会...说... 100个群集密钥。但我认为10没什么大不了的。我知道我已经使用7或8创建了模型,它们的表现还不错。
If you have many of them, it may eventually impact performance. I can double-check on this one and get back to you. I wouldn't go with...say...100 clustering keys. But I don't think 10 is a big deal. I know that I've created models using 7 or 8, and they perform just fine.
2)如果还有其他非规范化表(例如usersbyhobby,usersbybookread等)。 )与用户相关,由于c *中没有JOIN,如何将这些表中的过滤器组合到usersbycity中的过滤器组合成一个查询?
2) "If there are other denormalized tables (like usersbyhobby, usersbybookread etc.) related to user, how can I combine filtering from these tables to filters from usersbycity into one query, since there is no JOINs in c*?"
在查询时。 可以做的是,如果发现有一个查询同时需要来自usersbyhobby,usersbybookread和usersbycity的数据;是创建包含所有数据的非规范化表。根据查询的需要,您可能需要以不同的方式订购PRIMARY KEY,在这种情况下,您将需要创建与要服务的特定查询一样多的表。
You cannot combine them at query-time. What you can do, is if you find that you have a query that needs data from usersbyhobby, usersbybookread, and usersbycity all at once; is to create a denormalized table containing all of that data. Depending on your query needs, you may need to order the PRIMARY KEY different ways, in which case you would need to create as many tables as you have specific queries to serve.
另一种选择是进行单个查询并在客户端进行管理。客户端JOIN被认为是Cassandra的反模式,因此我会谨慎使用。这全都取决于应用程序的需求,以及您是否要花费大部分时间在数据建模/管理或客户端处理上。老实说,我更愿意尽可能简化客户端。
The other alternative, would be to make individual queries and manage them client-side. Client-side JOINs are considered to be a Cassandra anti-pattern, so I would use that with caution. It all depends on the needs of your application, and whether you want to spend the majority of your time working on data modeling/management or in processing on the client side. Honestly, I prefer to keep the client side as simple as I can.
这篇关于用“包含”查询在用户定义类型(UDT)列表上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!