First of all, if simply took out a pile of words let me classification, this I really can't do that, I don't know if anyone can do it, anyway, I cannot do it.Do keyword classification, for me, there must be some basic information, basic data in the background.
Case 1: baidu business word clustering modelWatch the news now, we often discuss a topic, baidu how much is the income contribution than the health care industry, in fact, I wait a blasting aniseed, in 2005 and even 2005 years ago, baidu don't master this kind of data.At that time, baidu has a simple customer classification, is the submission of the service, and then we look at the consumer industry distribution, the results show that more than 50% belongs to other classification, the result basically can not be looked at.Then I just wondering, can use commercial word clustering for industry directly, I was in the product department, engineers cooperation against click fraud is Zhang Huaiting (who seems to be still in baidu), it's a algorithm, his graduation thesis is the association rules and clustering algorithm, I go to ask him, he said a lot, I most don't understand, but probably know some key points, and then he looked to the paper, also didn't see understand, I should start with a superficial understanding, and this also is made.The starting point is to assume that the client itself has industry attributes (if the hypothesis does not exist, then the feeling), I think every customer submit keywords, are related to each other.If a two keywords submitted at the same time by different customers, its relevance will increase, this is one of the most basic definition, called where the number.A value which is the most easy to calculate.But just depend on where there is a problem, that will cause a lot of words associated with hot word, it is not reasonable, I remember as a online bookstore's recommendation to buy that one bar, clear are popular books, seems to be recommended based on connection number to do.
Question 1: A and B have 50, recommended A and C have recommended 30, but B is A hot word, the word of 2000 customer submit;And C is less popular words, only 50 customers, A and B, please the correlation between high or high correlation between A and C?Question 2: customer 1 submitted 10000 words (similar to ali, cyts Tours is really so submitted).Customer 2 submitted 20 words, 1 submitted 10000 words each other correlation and the customer submitted are consistent between the two?Consider these two issues, do a weight adjustment, (actually I remember there is a right value, for a long time, not sure) and calculate the associated values of word and the word.
So, how do weights set?Ha ha, to be honest, sometimes, but take a to check.Implementation process took less than one afternoon, and then run the program again, about 1 hour to 2 hours (at that time the baidu's business word is not so much, the customer didn't also now many, actually my program efficiency is not good enough).Then I made a web interface, with arbitrary input one word, listed its correlative and associated values, visual bad case, analysis of parameters of the problem, and then modify the parameters, run it again...Run more than n times, about two or three days time, feel the results about, word and the word association established, considering the second step, the clustering.(at that time, met a lot of wonderful work business word, an eye-opener, thoroughly improved understanding of the Internet industry, such as white, miss wong tai sin, ahem, ahem, this field will not be able to say it again)The idea of clustering is very simple, put every industry on behalf of the word (in) was associated with A lot of words, as A core word, and then based on word association (extending associated primary associations, secondary, tertiary correlation, such as A and B, B in C, C and D, attenuation, calculate the weights of each draw A and D).As far as possible put all words aggregated into the core words, vocabulary into industry.
Started the core word I pick from the repository and other word correlation is higher more than 20, and multistage racquet head weight attenuation was thinking, and then run again, see the two indicators, first, how much is the coverage;Second, how accurate, each industry correlation selected minimum word (a bad case of density is higher, some words will at the same time by two industry core word associations, but the weight calculation will be a problem, lead to be incorporated into the wrong industry) to see, select the word has not been associated to see, analysis weight problems.And then modify the attenuation parameters, increase the core words.The program I also wrote an afternoon, but debugging weight and increase the core word, after a week.Then, baidu business analysis finally can launch, based on the industry's income statement.I'm proud to say, baidu for income distribution industry, is based on the keyword classification algorithm I start, of course, today the corners, I don't have enough algorithm efficiency (early, to a larger scale and more customers not), coverage and accuracy is not very perfect (or there has been a bad case, but as far as possible control within 10% of the total amount of consumption, is must for hot word, but for some long tail control).However, I live in product department did this, ha ha.
Later, this model is also used in intelligent starts.Here said the intelligence at gossip.Smart start actually is a failed business trying to baidu, the harm of the business is very big, but the design concept of early and no big problem, baidu was based on keyword bidding (at the time of bidding mode is very simple, don't tell me now, baidu's bidding mode, it is not I know), the commercial value of mining is flawed, such as some super hot word, three hair a click is not sold (such as movies, games).If we can sell cheaper?For some very long tail words but the value is very high, because found fewer customers, so price is very low, and its commercial value is not weak, such as "green dry cleaners latest quotation" the long tail word may participate in the bid only one or two customers, but its commercial value will not lower than the word "dry cleaners" such a high price.So smart starts at the real goal is to hit non-commercial word price, to upset long tail business term increases.Then I put forward a point of view, keywords should be correlative with him starting at average click price related.Then they take the model to tell leadership, it soon passed.(about 1, zhao mou, children's shoes to speak of being led to ask MAO, said method is the technology department, he is not very clear, he went to speak clearly is I provide prototype! But then technical department to do another version, but that's another story.)Intelligent starting model failure reasons, there are two reasons, first, when they were on in order to improve the coverage of non-commercial word hard put key words contain rules, please.Led to some bad case.(such as tablets, tablets, not an industry).Effect is very bad at that time, the leadership is very uncomfortable, criticize my bad too many cases, a heap of, I went to check each listed no one is my algorithm calculated, are words containing package.Actually the problem is not serious, word contains is bad case, but the limited effective;The second problem is more serious, is to lead is too urgent, the thing is, I suggest starting at lower weights, (every word associated by algorithm calculation average price and average price associated * = the term starts at starting parameters, this is the basic formula, starting at all depends on the personal judgment parameters), the effect of slowly adjust, the results led up a set of fairly high, so the customer greatly, various wipe lasted a few months.Baidu the quarter was poor.Phoenix nest, intelligent starts at finally died.Phoenix nest is more perfect, more comprehensive, it must be admitted.
Case 2: search words/index words clusteringAnd, of course, can also be based on search online submission to calculate the correlation degree, but first of all, Internet users search behavior, the business is not as customers behavior that words is classification properties.Secondly, the processing capacity of I was also match for such data correlation calculation.(ok, now also is set.)So based on what?Based on keywords + searches.It is Zhang Huaiting help me, I'm most engineers conceived pavilion, baidu cooperation at that time he help me do all the searches, including the summary and the brush handle (except some IP, the client clear mark, another is the most important rules that are based on the channel distribution and client distribution rules, the normal search term, from the proportion of different channels should be follow a reasonable distribution ratio, the so-called channel including the baidu website, hao123, other union channel and so on, if you don't abide by the proportion serious, basic is the index of the brush, but the rule has no application to the baidu index, at least not applied to at the time, the reason seems to be that, at that time, almost all on the list of popular actress, seems to have brokerage company or fans of brush list.)So when my hand all search words are baidu search data (brush clean up index data), and updated daily.So how to classify?If do the classification, I really can't do that, but the hot word is can do.There is a point, it is every hot keywords, are not exist in isolation.And these hot word related words (based on words contain), will carry some suggests that the roots of the industry property, which can then be back this hot word attributes of the industry.(by the way, I am not the above formula contains a bad case, ahem, this, don't chide. Only word and searches, to do classification, some bad case also can only be manually adjusted.)
For example,A popular game, such as the fairy way, there will be a lot like Fairy guide, immortal way new clothing, fairy props, fairy, plugins, and so on related words.Through these related word roots (can be classified to root tag attributes) back in the original word, and the classification of the original words all related words.Such as TV, common root has "the first * * set, latest set.", a novel, common root has "chapter * *", "new chapter", etc.And, of course, there is a situation, some words are many meanings, such as the typical such as apple (IT products, film, fruit).Previous (TV, games), etc.Through the analysis of the root, and based on the search volume under the different root weight, get the word search property, to which field, or the proportion of every field, yes, is not very accurate, but still has a certain value.Implementation method is, for each unclassified hot word, to traverse contains all of his word, and then based on the classification of the predefined root to set, to contain each classification root of long tail word, according to the search volume weighted aggregation, get the classification properties of the hot word, and cover the hot word include all the long tail word classification properties.The algorithm ideas, not suitable for the long tail word mining (containing the industry's long tail word can override attribute root, but after all coverage is not enough), but for baidu hot list can have very good help, and automatic classification of hot word mining has some grasp of, at that time a lot of people complain about I said baidu hot list update not in time, some new games are fire also couldn't get into the hot list, I took the baidu hot list of product managers and technology sharing, also provide a prototype code, then no then.At least at that time, I can see the classification of Internet users search behavior constant proportion (baidu long tail word is too much, my model covers searches only 50% or so), and the change trend, such as watching the video class search proportion rapid growth.
It is an article from blow is given priority to, the, what also don't say, you are bad. see.
No comments:
Post a Comment