|
Chapter: A
Scalable Software Development Model
6. Clustering and Data
Mining
Data mining is the process of identifying patterns within the data you
have acquired. The purpose of doing this is to place the relationships
that exist between the various aspects of your data in a more
mathematical context that can ultimately be used programmatically (in
the program you are developing).
Data mining can be done manually or it could be automated. When dealing
with a small data set with a high ratio of inconsistent data types,
manual data mining could be more effective and save you some time.
For example, lets take a hypothetical situation where the owner of a
small café has asked you to determine what items a customer is
likely to purchase together. The data set you have acquired consists of
all the items customers have purchased in the store over the past year.
From that data you could determine that goods can be divided up into
food, drinks, magazines, stationary etc, and then into even smaller
groups like fruit, vegetables, soft-drinks etc. Identifying these
groups would be the first step of data mining and is a process also
known as clustering. If the café has a large variety of items to
choose from the clusters making up the data could resultantly be
numerous, yet the data set as a whole is actually quite small, only
consisting of a single year of purchased items.
In contrast a more established café that has been operating for
several years proposes the same question to you. In the case of the
smaller café groupings of purchased items would yield a lower
probability of repeating in a shorter period of time. In contrast the
more established café has a better chance of the same groups of
items being purchased together over a longer period of time.
In the former case manually mining this data set could be more
effective because of the lower probability of the same items being
purchased together in a relatively short space of time. In this
scenario the majority of your time would be spent on clustering and
populating the resultant groups after eliminating the majority of items
purchased in that year because they will not fall into any cluster.
However in the case of the established café although there may
be just as many clusters the values that these clusters are populated
with have a higher probability of being repeated, it might therefore be
more efficient to have a computer program count, cluster and mine the
data.
Regardless of whether you choose to manually mine your data or have a
software program do the work for you, you should have a set of data at
the end of the process that can be manipulated programmatically.
The process of
getting external data into a program, via clustering and data mining.
|
|