Jul 10, 2010 09:58 PM|Pithecanthropus|LINK
I'll try to describe this without overloading everyone with too much background information, but the short and sweet version is that I am working on a bibliographic application which, among other things, analyzes the text of documents to generated inverted
lists that are searchable by the individual words ("descriptors") and provide the document IDs in which they occur, and the number of occurrences in each document. Naturally there's a database involved, but at this point, for this particular dataset, the
table in SQL Server is empty and everything is happening with like-defined records in memory.
At this point, I'm working on a document by document basis, but will eventually want to process this data in larger sets once I have confirmed that everything works at the unit level.
My data is defined as follows:
Collection: A set of documents identified for collective processing and analysis. Identified by a 7-digit integer.
CollectionDetail: The individual documents identified by a Collection,
having typical bibliographic identifying data such as author, title, call number; and also a brief abstract on which the subject analysis is based. In this case, the abstracts are mostly just the subject cataloging descriptor entries. A
CollectionDetail object is identified by the CollectionID with which it is associated plus a SequenceNumber field, also an integer.
InvertedList: As described above. The fields are Descriptor, CollectionID, SequenceNumber, and Occurrences.
The CollectionDetail partial class includes a method to generate the
InvertedList records in the database for that document. Before I describe how that method is supposed to work, I should point out that I also have a static class ExtendIQueryable which simply provides a convient way of comparing two IQueryables to perform
minus and subsetting queries.
public static class ExtendIQueryable
public static IQueryable<T> In<T>(this IQueryable<T> source, IQueryable<T> checkAgainst)
return from s in source
public static IQueryable<T> NotIn<T>(this IQueryable<T> source,
return from s in source
The method in the CollectionDetail partial class starts as follows, to "calculate" the InvertedList records based on what the documents contain, and also to determine what IL records already exist in the database. These results are ProspectiveILRecs and ExistingILRecs respectively.
public bool DocumentUpdateIVRecsDB()
InvertedListRepository ivr = new InvertedListRepository();
MasterStopListRepository mslr = new MasterStopListRepository();
IQueryable<InvertedList> ExistingILRecs =
from Details in idc.InvertedLists
where Details.CollectionID == this.CollectionID
&& Details.SequenceNumber == this.SequenceNumber
Because I don't yet have anything in the InvertedList table in the database, ExistingILRecs comes back empty as I expect. ProspectiveILRecs may have a few records or a great many, but by definition will always have some.
The next step is to compare the two IQueryables, because anything currently in the database not appearing in the prospective list would have to be deleted, if they were present.
//First delete the existing IL records that we no longer need.
IQueryable<InvertedList> ILRecsToDelete = ExtendIQueryable.NotIn(ExistingILRecs, ProspectiveILRecs);
followed by the "foreach" block
foreach (InvertedList ilrec in ILRecsToDelete)
InvertedList il = ivr.GetSingleIVLRecord(ilrec.CollectionID, ilrec.SequenceNumber, ilrec.DescriptorValue);