问题描述
我有一本书籍和作者的数据集,有多对多的关系。有大约10 ^ 6本书和10 ^ 5作者,每本书平均有10位作者。
我需要对数据集执行一系列操作,例如计算每个作者的书数或删除所有
什么是可以快速处理的良好数据结构?
我希望有一些现成的模块可以提供以下方法:
obj.books.add( book1)
#links
obj.books [n] .author = author1
obj.authors [m] .author = book1
#删除
obj.remove(author1)#应该自动删除author1的所有链接到图书,而不是链接的书籍
我应该澄清一点,我宁愿不使用一个数据库,而是在内存中进行。
谢谢
(或任何其他良好的关系DB,但$ code> sqlite 附带Python,对于这样一个相当小的数据集更为方便)似乎是您的任务正确的方法。如果您不想学习SQL,是关系数据库中流行的包装器,可以这么说,这允许您以您选择的几种不同抽象级别中的任何一种来处理它们。
并且完全在内存中是没有问题的(它是愚蠢,记住你,因为你不必要地在每一次运行程序中从持续的地方读取所有的数据,同时把数据库保存在一个磁盘文件上可以节省你的开销 - 但是,这是一个不同的问题;-)。只需将您的sqlite数据库打开为':memory:'
,那就是 - 一个全新的关系数据库,完全在内存中(仅在过程中)程序中没有涉及到 的磁盘。所以,为什么不这样做? - )
个人而言,我会直接使用SQL来完成这个任务 - 它可以很好地控制正在发生的事情,并且很容易让我添加或删除索引以调整性能等。您将使用三个表:一个图书
表(主键ID,其他字段,如Title& c),作者
表(主键ID,其他字段如Name& c)和多对多关系表,例如 BookAuthors
,只有两个字段, BookID
和 AuthorID
,每个作者连接一个记录
BookAuthors
表的两个字段是所谓的外键,分别指代ID图书和作者的领域,您可以使用 ON DELETE CASCADE
定义它们,以便引用被删除的图书或作者的记录依次自动删除 - 一个例子甚至裸的SQL让你工作的高语义层次,没有其他现有的数据结构可以接近匹配。
I have a data set of books and authors, with a many-to-many relationship.
There are about 10^6 books and 10^5 authors, with an average of 10 authors per book.
I need to perform a series of operations on the data set, such as counting the number of books by each author or deleting all books by a certain author from the set.
What would be a good data structure that will allow fast handling?
I'm hoping for some ready made module that can provide methods along the lines of:
obj.books.add(book1)
# linking
obj.books[n].author = author1
obj.authors[m].author = book1
# deleting
obj.remove(author1) # should automatically remove all links to the books by author1, but not the linked books
I should clarify that I prefer not to use a database for this, but to do it all in memory.
Thanks
sqlite3 (or any other good relational DB, but sqlite
comes with Python and is handier for such a reasonably small set of data) seems the right approach for your task. If you'd rather not learn SQL, SQLAlchemy is a popular "wrapper" over relational DBs, so to speak, that allows you to deal with them at any of several different abstraction levels of your choice.
And "doing it all in memory" is no problem at all (it's silly, mind you, since you'll needlessly pay the overhead of reading in all the data from somewhere more persistent on each and every run of your program, while keeping the DB on a disk file would save you that overhead -- but, that's a different issue;-). Just open your sqlite database as ':memory:'
and there you are -- a fresh, new relational DB living entirely in memory (for the duration of your process only), no disk involved in the procedure at all. So, why not?-)
Personally, I'd use SQL directly for this task -- it gives me excellent control of exactly what's going on, and easily lets me add or remove indices to tweak performance, etc. You'd use three tables: a Books
table (primary key ID, other fields such as Title &c), an Authors
table (primary key ID, other fields such as Name &c), and a "many-to-many relationship table", say BookAuthors
, with just two fields, BookID
and AuthorID
, and one record per author-book connection.
The two fields of the BookAuthors
table are what's known as "foreign keys", referring respectively to the ID fields of Books and Authors, and you can define them with an ON DELETE CASCADE
so that records referring to a book or author that gets deleted are automatically dropped in turn -- an example of the high semantic level at which even "bare" SQL lets you work, which no other existing data structure can come close to matching.
这篇关于Python中的多对多数据结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!