200000m= km,500000c...

From , , & :
In Guangzhou, there are about 200,000 Africans, increasing 30-40% every year. The locals believe their language skills are very poor, but they have very n Locals believe they are undisciplined and unorganized, but their religion and groups ar They, in Guangzhou, have built the largest African community in Asia.[click to enlarge]
As Chinese companies have entered Africa to find resources, African businessmen have also come to China, “the world’s factory”. Businessmen ship cheap goods to Africa, where 50 far-away African countries quickly consume these daily consumables that can’t be produced in their own countries. At the end of the 90s in the 20th century, the first batch of Africans came to Guangzhou, their first stop being Canaan clothing market [Clothes Trading Center]. Now, however, with Canaan clothing market as the center, many goods for export markets have sprung up in the surrounding one kilometer area. The people of Guangzhou have gradually come to call this area “Chocolate City”.
Most Africans don’t actually live long-term in China, only often traveling between Africa and China, as little as once or twice a year, or as much as once a month. The majority of people operate a shop in their country, but personally come to Guangzhou to select goods to ship back. Photo is of several African businessmen and a Chinese businessman negotiating prices.
Scant funds, don’t care about brands, loves to bargain, likes low-end products are characteristics of the large groups of African businessmen. Over time, these characteristics have led to Chinese businessmen to discriminate and become impatient with them. “[They’re] the most practical in doing business the most practical, whereas you can see those European and Americans and Arabs are just different,” a Chinese seller said. Nevertheless, the trade market’s business is very flourishing every day, and the African demand for cheap goods have allowed the processing factories around the outskirts of Guangzhou to prosper. Photo is of a Chinese seller wiping the nose of an African buyer’ using “friendliness” to get business.
In Xiaobei, not far from the trading market, is Guangzhou’s largest African neighborhood. Many Africans coming to China for the first time will stay here, living with several or even over a dozen people in a room, beginning their “gold rush” here. Why have they collectively chosen Xiaobei? One long-term researcher of Africans in Guangzhou says: “This place has Guangzhou’s first proper Muslim restaurant.” And in Africa, those who believe in Islam are the majority. Photo is of an African youth eating at a food stall.
As it is understood, there are 20,000 Africans who have stayed over 6 months. However, if those who illegally overstay and those who frequently come and go are added together, “the real number” should be around 200,000. This is equivalent to 2% of Guangzhou’s registered population. The expansion of the African’s export business has also spawned African restaurants, African logistics, African intermediaries and other supporting businesses. African businessmen have also brought African laborers and African service staff. Photo is of locals who are no longer unused to seeing Africans.
The people of “Chocolate City” start their At night, people began their nightlife. Even if they are the lowest level of black laborers, they too will come out to spend their meager salaries. In practice, the main reason they like to come out at night is to avoid police inspections. Affected by visa “tightening”, quite a number of the African laborers here do not have legal residence permits, while many visas and passports are also expired. Photo is of Africans drinking beer at a food stall.
Guangzhou’s other African gathering spot is Shishi Catholic Church. Every Sunday afternoon, Shishi Church’s English Mass feels like one is in Africa. Not only is 80% of the congregation for mass “black faces”, even the service staff are all African youths. Sometimes, there are over 1000 Africans attending mass.
When Nelson, a Nigerian, arrived in Guangzhou, he lived a typical “luggage bag” life—-carrying several tens of thousand of yuan here to purchase goods and then afterward stuffing them all in a few large luggage bags to fly back with him to Africa. “If I’m lucky, I can get on the plane without it being overweight and having to ship it.” Nelson says that the money for the his airplane ticket and for the goods to be purchased was pooled together by his entire family, that he must earn money, otherwise he will be looked down upon when he returns to Africa. Photo is of Nelson at a motorcycle parts store selecting goods.
Nelson finishing a day’s work, giving a little girl begging on the streets some spare change.
For a first-timer like Nelson, language is the biggest barrier. On this day, Nelson has just discovered that the batch of goods he just purchased is short a few shirts. He doesn’t know if the Chinese seller forgot and wants to call to ask but isn’t able to. Language barriers and major cultural differences often bring a lot of trouble, and this makes him very depressed. Photo is of Nelson ordering food at an African restaurant.
Even though some say only 15% of Africans have obtained success in Guangzhou, Nelson believes that simply having his own business in China can be considered good fortune. There are also African compatriots who have come here to work for a living but end up being unable to save enough to go back, and must even avoid the 500 yuan fine for each day overstaying on an expired visa. Photo is of Nelson at a printing shop developing photographs to mail to his family in Nigeria.
Ojukwu Emma, a Nigerian, in comparison to those Africans living in the villages-within-the-city [ghettos], is of the minority that has their own office. At the same time, Emma is also the “head” of the Nigerian Association. Photo is of Emma and his Chinese employee in his office, with the Hong Kong SAR flag, Nigerian flag, and Chinese national flag in the back.
Authorized by the Nigerian Embassy and because the number of Nigerians in Guangzhou are many, the association helps them take care of certain things more easily, such as collecting money for medical care for compatriots who are sick and helping newcomers with living arrangements. Emma says the association not only helps its own countrymen, it also helps Chinese and people of other countries in disputes with Africans. Photo is of a Chinese businessman who with Emma’s help was able to recover money that had been conned. Emma says: “It isn’t easy being the association’s leader, as one gives much more than one receives.”
The vast majority of Africans doing business in Guangzhou organize themselves by country, each having their own associations and leaders. Their titles are different, with some called “Chairman”, while others called “Leader”. Most of them are older, more highly educated, and their businesses more successful.
The leading reason for these African businessmen’s success is because they are known for “being trust-worthy/honest, and handle business according to Chinese norms”. Photo is of a meeting of the Nigerian Association where a businessman’s watch has a Chinese “five star” flag on it.
There are even some African bosses who through their ability and economic foundations have married and had children in Guangzhou, laying down roots in China. Photo is of an African businessman and his Chinese wife.
However, most African still live in their own circles, them believing that Chinese people are very difficult to engage, and thus difficult to become friends with. One African businessman says: “My family has asked me what I have seen in China, and I say I have only seen jeans and black people.” According to Arnold, an Associated Press journalist previously based in Africa, China doesn’t actually have racial discrimination against Africans, “the so-called discrimination is instead similar to how urban residents discriminate against people from the rural countryside who have no money nor know the rules.” Photo is of Chinese and Africans at a food stall.
A Chinese person who is familiar with Africans says: Africans are afraid of the police, so they do their best to avoid contact with the police. According to regulations, they are supposed to register at the foreigner service center within 24 hours of entering Guangzhou, but they have this fear, so they don’t go register but this actually creates problems for themselves. Photo is of a police officer checking the identification of an African personal.
In 2009 July, a black person attempting to hide from the Guangzhou police inspection/check accidentally fell from an 18 meter building to his death. This incident incited hundreds of blacks to gather in front of the police station the next day in a confrontation with the police.
At that moment, Guangzhou’s Africans begin expressing their own voice. Photo is of an African using the middle finger during the confrontation.
Posted on a street-side photography stall are several souvenir photographs of Africans. Even though the visa to enter China is even though in China they still get looks, the number of Africans going to Guangzhou still increases by 30-40% every year. Reports show that more and more Africans through Guangzhou are gradually spreading to Beijing, Shanghai, and other cities.
Comments from :
谬悠 [网易福建省福州市网友]:
If you can understand Fujian people leaving their hometowns, going all over the world to find a living,
then you can understand these Africans.
Of course this has nothing to do with the law.
drinkspring [网易江苏省无锡市网友]:
Guangdong should limit the number of Africans.
溜溜乡王保长 [网易重庆市网友]:
Chinese people will pay the price for their kindness, and the Chinese government’s lack of supervision will become a factor in Guangzhou’s future unrest/turmoil. Next, Guangdong’s Africans will strive for political rights and the support of the international community.
网易上海市网友:
I think children who come from the rural countryside are even more capable of understanding and empathizing with the situation for Africans in Guangzhou, because both of them are people who live at the lowest level of society, forced to expend N times more effort in order to climb up.
崖柳 [网易山东省烟台市网友]:
What can Africans bring us besides AIDS?????!!!!
I am a customs officer who monitors infectious diseases, just look at how many people checked that are AIDS sufferers from Africa and you’ll know we should keep such garbage far away.
hugejob [网易美国网友]:
To be honest, I don’t have any confidence/trust towards Africans.
网易浙江省金华市义乌市网友:
Poverty breeds violence and crime, while wealth breeds greed and slaughter.
路漫漫兮要修理矣 [网易浙江省温州市网友]:
On one side is family planning and on the other side is a loose immigration policy. I won’t see a black person become Chairman in my lifetime.
网易广东省广州市网友:
Chinese people know their place and are orderly wherever they are, an active and motivated people… As for black people, they are lazy and carefree wherever they are, and like to cause trouble, not diligent in learning, nor in work. One day, Guangzhou too will have riots, beating, smashing, and looting and then they’ll recognize their mistake, and be paralyzed. While Han people are limited form having children and these black people have so many, what will we do when they come? Paralyzed. Go to Xiaobei, it’s all black people, I don’t even know if I am in Guangzhou or Africa.
网易河南省洛阳市网友 [桜木様]:
I’m been to Guangzhou and I must say this place has almost become Africa… Strongly demand that black illegal immigrants in Guangzhou be investigated. You guys should stop posturing and go to Guangzhou yourself and take a look and you’ll see, so sad.
网易广西桂林市网友 [没毛的狮子]:
Blacks are simply a low-level race—– This comment is something I heard elsewhere.
Think about it and you know it is true. When white people ruled South Africa and social resources were in the white people’s hands, all various aspects of South Africa achieved great development! But after Mandela overthrew white rule, South Africa, it can’t be said that South Africa hasn’t had development, but the development has all been focused on modern technology to increase social improvement, but there’s almost no social control development and the violence rate has increased daily!!
Then look at all the back people in the world. Those who are successful are all obviously concentrated in the sports, obviously becoming rich overnight, basically all unable to control their own , repeatedly dividing their own assets in divorces until they are bankrupt. Or those who can control their own
all squander their wealth, while those black people who can considered at the top of any field are rarer than rare, and can essentially be considered non-existent.
In reality, black people are gluttonous and lazy, unrealistic, and those who can work hard are rarer than rare, wanting in their bones to do little but still get a lot. They don’t seek to improve themselves!!
网易山西省手机网友:
I haven’t interacted with black people so I don’t know, but them relying on their own labors can’t be wrong, right? At least it is much better than those who rely on their parents.
网易湖北省网友:
The inevitable product of opening up and reform. Right now, the only to do is put an end to discrimination.
zxr584520 [网易四川省成都市网友]:
I look down on two kinds of people the most: One is racists and the other is black people!
xmair [网易福建省厦门市网友]:
The majority of blacks are representatives of promiscuity, violence, and AIDS.
What do you think?
Related Posts
Like, Follow, and Subscribe
Recently Discussed
Powered bysql - Can MySQL reasonably perform queries on billions of rows? - Database Administrators Stack Exchange
to customize your list.
Database Administrators Stack Exchange is a question and answer site for database professionals who wish to improve their database skills and learn from others in the community. It's 100% free, no registration required.
Here's how it works:
Anybody can ask a question
Anybody can answer
The best answers are voted up and rise to the top
I am planning on storing scans from a mass spectrometer in a MySQL database and
would like to know whether storing and analyzing this amount of data is remotely
feasible. I know performance varies wildly depending on the environment, but I'm
looking for the rough order of magnitude: will queries take 5 days or 5
milliseconds?
Input format
Each input file contains a single run each run is comprised
of a set of scans, and each scan has an ordered array of datapoints. There is a
bit of metadata, but the majority of the file is comprised of arrays 32- or
64-bit ints or floats.
Host system
|----------------+-------------------------------|
| Windows 2008 64-bit
| MySQL version
| 5.5.24 (x86_64)
| 2x Xeon E5420 (8 cores total) |
| SSD filesystem | 500 GiB
| HDD RAID
|----------------+-------------------------------|
There are some other services running on the server using negligible processor
File statistics
|------------------+--------------|
| number of files
| total size
| min size
| max size
| total datapoints | ~200 billion |
|------------------+--------------|
The total number of datapoints is a very rough estimate.
Proposed schema
I'm planning on doing things "right" (i.e. normalizing the data like crazy) and
so would have a runs table, a spectra table with a foreign key to runs,
and a datapoints table with a foreign key to spectra.
The 200 Billion datapoint question
I am going to be analyzing across multiple spectra and possibly even multiple
runs, resulting in queries which could touch millions of rows. Assuming I index
everything properly (which is a topic for another question) and am not trying to
shuffle hundreds of MiB across the network, is it remotely plausible for MySQL
to handle this?
UPDATE: additional info
The scan data will be coming from files in the XML-based
format. The meat of this format is in the
&binaryDataArrayList& elements where the data is stored. Each scan produces >=
2 &binaryDataArray& elements which, taken together, form a 2-dimensional (or
more) array of the form [[123.456, 234.567, ...], ...].
These data are write-once, so update performance and transaction safety are not
My na?ve plan for a database schema is:
runs table
| column name | type
|-------------+-------------|
| PRIMARY KEY |
| start_time
| TIMESTAMP
|-------------+-------------|
spectra table
| column name
|----------------+-------------|
| PRIMARY KEY |
| spectrum_type
| representation | INT
| FOREIGN KEY |
|----------------+-------------|
datapoints table
| column name | type
|-------------+-------------|
| PRIMARY KEY |
| spectrum_id | FOREIGN KEY |
| num_counts
|-------------+-------------|
Is this reasonable?
UPDATE #2: The update strikes back
So, as you may have been able to infer, I am the programmer, not the biologist
in the lab, so I don't know the science nearly as well as the actual scientists.
Here's a plot of a single spectrum (scan) of the kind of data with which I'll be
The goal of the software is to figure out where and how significant the peaks
are. We use a proprietary software package to figure this out now, but we want
to write our own analysis program (in R) so we know what the heck is going on
under the sheets. As you can see, the vast majority of the data are
uninteresting, but we don't want to throw out potentially-useful data which our
algorithm missed. Once we have a list of probable peaks with which we're
satisfied, the rest of the pipeline will use that peak list rather than the raw
list of datapoints. I suppose that it would be sufficient to store the raw
datapoints as a big blob, so they can be reanalyzed if need be, but keep only
the peaks as distinct database entries. In that case, there would be only a
couple dozen peaks per spectrum, so the crazy scaling stuff shouldn't be as much
of an issue.
migrated from
This question came from our site for professional and enthusiast programmers.
I am not very familiar with your needs, but perhaps storing each data point in the database is a bit of overkill. It sound almost like taking the approach of storing an image library by storing each pixel as a separate record in a relational database.
As a general rule, storing binary data in databases is wrong most of the time. There is usually a better way of solving the problem. While it is not inherently wrong to store binary data in relational database, often times the disadvantages outweigh the gains. Relational databases, as the name alludes to, are best suited for storing relational data. Binary data is not relational. It adds size (often significantly) to databases, can hurt performance, and may lead to questions about maintaining billion-record MySQL instances. The good news is that there are databases especially well suited for storing binary data. One of them, while not always readily apparent, is your file system! Simply come up with a directory and file naming structure for your binary files, store those in your MySQL DB together with any other data which may yield value through querying.
Another approach would be using a document-based storage system for your datapoints (and perhaps spectra) data, and using MySQL for the runs (or perhaps putting the runs into the same DB as the others).
I once worked with a very large (Terabyte+) MySQL database. The largest table we had was literally over a billion rows. This was using MySQL 5.0, so it's possible that things may have improved.
It worked. MySQL processed the data correctly most of the time. It was extremely unwieldy though.
Just backing up and storing the data was a challenge. It would take days to restore the table if we needed to.
We had numerous tables in the 10-100 million row range. Any significant joins to the tables were too time consuming and would take forever. So we wrote stored procedures to 'walk' the tables and process joins against ranges of 'id's. In this way we'd process the data 10-100,000 rows at a time (Join against id's 1-100,000 then 100,001-200,000, etc). This was significantly faster than joining against the entire table.
Using indexes on very large tables that aren't based on the primary key is also much more difficult. Mysql stores indexes in two pieces -- it stores indexes (other than the primary index) as indexes to the primary key values. So indexed lookups are done in two parts: First MySQL goes to an index and pulls from it the primary key values that it needs to find, then it does a second lookup on the primary key index to find where those values are.
The net of this is that for very large tables (1-200 Million plus rows) indexing against tables is more restrictive. You need fewer, simpler indexes. And doing even simple select statements that are not directly on an index may never come back. Where clauses must hit indexes or forget about it.
But all that being said, things did actually work. We were able to use MySQL with these very large tables and do calculations and get answers that were correct.
Trying to do analysis on 200 billion rows of data would require very high-end hardware and a lot of hand-holding and patience. Just keeping the data backed up in a format that you could restore from would be a significant job.
** edit **
I agree with @srini.venigalla that normalizing the data like crazy may not be a good idea here. Doing joins across multiple tables with that much data will open you up to the risk of file sorts which could mean some of your queries would just never come back. Denormallizing with simple, integer keys would give you a better chance of success.
normalizing the data like crazy
Normalizing the data like crazy may not be the right strategy in this case. Keep your options open by storing the data both in the Normalized form and also in the form of materialized views highly suited to your application. Key in this type of applications is NOT writing adhoc queries. Query modeling is more important than data modeling. Start with your target queries and work towards the optimum data model.
Is this reasonable?
I would also create an additional flat table with all data.
run_id | spectrum_id | data_id | &data table columns..& |
I will use this table as the primary source of all queries. The reason is to avoid having to do any joins. Joins without indexing will make your system very unusable, and having indexes on such huge files will be equally terrible.
Strategy is, query on the above table first, dump the results into a temp table and join the temp table with the look up tables of Run and Spectrum and get the data you want.
Have you analyzed your Write needs vs Read needs? It will be very tempting to ditch SQL and go to non-standard data storage mechanisms. In my view, it should be the last resort.
To accelerate the write speeds, you may want to try the Handler Socket method. Percona, if I remember, packages Handler Socket in their install package. (no relation to Percona!)
4,86542862
The short answer is a qualified yes -- as the number of rows grows the precise schema, datatypes and operations you choose grows in importance.
How much you normalize your data depends on the operations you plan to perform on the stored data. Your 'datapoints' table in particular seems problematic -- are you planning on comparing the nth point from any given spectra with the mth of any other? If not, storing them separately could be a mistake. If your datapoints do not stand alone but make sense only in the context of their associated spectra you don't need a PRIMARY KEY -- a foreign key to the spectra and an 'nth' column (your 'index' column?) will suffice.
Define the inter- and intra-spectrum operations you must perform and then figure out the cheapest way to accomplish them. If equality is all that's needed they may be denormalized -- possibly with some pre-calculated statistical metadata that assist your operations. If you do absolutely need in-SQL access to individual datapoints ensure you reduce the size of each row to the bare minimum number of fields and the smallest datatype possible.
The largest MySQL I've ever personally managed was ~100 million rows. At this size you want to
by multiplying times the fixed size of each row (think pointer arithmetic) -- though the exact details depend on which storage engine you plan on using. Use MyISAM if you can get away with it, what it lacks in reliability it makes up for in speed, and in your situation it should suffice. Replace variable-size fields such as VARCHAR with CHAR(n) and use RTRIM() on your read queries.
Once your table rows are fixed-width you can reduce the number of bytes by carefully evaluating MySQL's
(some of which are non-standard). Every 1-byte savings you can eke out by converting a 4-byte INT into a 3-byte MEDIUMINT saves you ~1MB per million rows -- meaning less disk I/O and more effective caching. Use the . Carefully evaluate the floating point types and see if you can replace 8-byte DOUBLEs with 4-byte FLOATs or even &8 byte . Run tests to ensure that whatever you pick doesn't bite you later.
Depending on the expected properties of your dataset and the operations required there may be further savings in more unusual encodings of your values (expected patterns/repetitions that can be encoded as an index into a set of values, raw data that may only meaningfully contribute to metadata and be discarded, etc) -- though exotic, unintuitive, destructive optimizations are only worthwhile when every other option has been tried.
Most importantly, no matter what you end up doing, do not assume you have picked the perfect schema and then blindly begin dumping 10s of millions of records in. Good designs take time to evolve. Create a large but manageable (say, 1-5%) set of test data and verify the correctness and performance of your schema. See how different operations perform (/doc/refman/5.0/en/using-explain.html) and ensure that you balance you schema to favor the most frequent operations.
Did I say short? Whoops. Anyways, good luck!
Can you clarify your model - are there multiple spectra per run?
How many data points per spectra?
It would seem that the only reason to shred the data point data out of the XML (as opposed to the metadata like the time and type of run) and into a database form is when you are analyzing the spectra across arrays - i.e. perhaps finding all runs with a certain signature.
Only you know your problem domain right now, but this could be akin to storing music sampled at 96kHz with 1 sample per row. I'm not sure size is the issue more than how the data is used.
Querying across the data would be equivalent to asking the relative amplitude 2 minutes into the song across all songs by The Beatles.
If you know the kind of analyses which might be performed, it's quite possible that performing these on the signals and storing those in the metadata about the run might make more sense.
I'm also not sure if your source data is sparse.
It's completely possible that a spectrum in the database should only include non-zero entries while the original XML does include zero-entries, and so your total number of rows could be much less than in the source data.
So, like many questions, before asking about MySQL handling your model, stepping back and looking at the model and how it is going to be used is probably more appropriate than worrying about performance just yet.
Whether or not it works, you're always going to run into the same problem with a single monolithic storage medium: disks are slow. At 100 MB/s (pretty good for spinning media) it takes 3 hours just to read a 1TB that's assuming no analysis or seeking or other delays slow you down.
This is why very nearly every "big data" installation uses some sort of distributed data store. You can spend 8 times as much money building one super amazing computer to run your DB, but if you have a lot of data that can be scanned in parallel, you're almost always better off distributing the load across the 8 cheaper computers.
Projects like
were build specifically for purposes like this. You build a cluster of a whole bunch of inexpensive computers, distribute the data across all of them, and query them in parallel. It's just one of a half a dozen solutions all built around this same idea, but it's a very popular one.
I run a web analytics service with about 50 database servers, each one containing many tables over 100 million rows, and several that tend to be over a billion rows, sometimes up to two billion (on each server).
The performance here is fine. It is very normalized data. However - my main concern with reading this is that you'll be well over the 4.2 billion row mark for these tables (maybe not "runs" but probably the other two), which means you'll need to use BIGINT instead of INT for the primary/foreign keys.
MySQL performance with BIGINT fields in an indexed column is ridiculously horrible compared to INT. I made the mistake of doing this once with a table I thought might grow over this size, and once it hit a few hundred million rows the performance was simply abysmal. I don't have raw numbers but when I say bad, I mean Windows ME bad.
This column was the primary key. We converted it back to be just an INT and presto magico, the performance was good again.
All of our servers at the time were on Debian 5 and with MySQL 5.0. We have since upgraded to Debian 6 and Percona MySQL 5.5, so things may have improved since then. But based on my experience here, no, I don't think it will work very well.
Hm... I see oly two reasons why you would choose this kind of data structure:
you really need to do any datapoint vs any datapoint queries
you intend to perform all your logic in SQL
Now, I would suggest taking a long hard look into your requirements and verify that at least one of the above assumptions is true. If neither are true, you are just making things slower. For this kind of dataset, I would suggest first finding out how the data is expected to be accessed, what kind of accuracy you will need, etc - and then design your database around those.
P.S.: Keep in mind that you will need at least 36+5 bytes per data point, so with 200B datapoints that should give you at least 8.2 TB required space.
P.P.S.: You don't need the id column in the datapoints table, a PRIMARY KEY (spectrum_id, index) probably suffices (just beware that index may be a reserved word)
Tassos Bassoukos
DO NOT DO THIS IN MYSQL WITH DATA STORED ON A SINGLE DISK. Just reading that amount of data from a single medium will take hours. You need to SCALE OUT, NOT UP.
And you need to denormalize your data if you want to do effective data analysis. You are not designing a online system here. You want to crunch numbers, design accordingly.
Original answer below line.
The answer will vary depending on your queries, MySQL may not be the best tool for this job. You may want to look at solution you can scale "out" and not "up". If you are willing to put in some effort maybe you should look on a Map Reduce solution such as Hadoop.
If you want to do more ad-hoc queries
solution may be a good fit for you. Relevant presentation from Google I/O 2012:
So, the solution will depend on if this is a one-shot thing and if you want to reasonably support ad hoc queries.
13.9k53984
What kind of machine is the data going to be stored on? Is it a shared storage devices?
The ultimate factor that will dictate your query time is going to be your harddrives. Databases and their query optimizers are designed to reduce the number of disk I/Os as much as possible. Given that you only have 3 tables, this will be done pretty reliably.
A harddrive's read/write speeds are going to be 200-300 times slower than memory speeds. Look for harddrives with very fast latency and fast read and write speeds. If all this data is on one 2-TB drive, you're probably going to be waiting a long long time for queries to finish. Harddrive latency is ~10-15milliseconds while the memory latency is less than 10nanoseconds. Harddrive latency can be x slower than memory latency. The moving of the mechanical arm on the harddrive the is SLOWEST thing in this entire system.
How much RAM do you have? 16gb? Lets say that lets you hold 32 records. You have 16000 files. If you're going to linear scan all the datapoints, you could easily end up with 5-10 seconds in seek time alone. Then factor in the transfer rate 50mb/s? About 7 hours. Additionally, any temporarily saved data will have to be stored on the harddirve to make room for new data being read.
If you're using a shared storage device that's being actively used by other users... your best bet is going to run everything at night.
Reduce the number of nested queries helps also well. Nested queries result in temporary tables which will thrash your harddrive even more. I hope you have PLENTY of free space on your harddrive.
The query optimization can only look at 1 query at a time. So nested select statements can't be optimized. HOWEVER, if you know a specific nested query is going to result in a small dataset to be returned, keep it. The query optimization uses histograms and rough assumptions, if you know something about the data and the query then go ahead and do it.
The more you know about the way your data is stored on disk, the faster you'll be able to write your queries. If everything was stored sequentially on the primary key, it may be beneficial to sort the primaries keys returned from a nested query. Also, if you can reduce the set of datasets you need to analyze at all beforehand, do it. Depending on your system, you're look at around 1 second of data transfer per file.
If you're going to modify the Name values(the varchars) I would change it to a datatype with a maximum size, it'll prevent fragmentation and the trade off is just a few more bytes of memory. Maybe an NVARCHAR with 100 maximum.
As far as the comments about denormalizing the table. I think it may be best to just store the datapoints in larger groups(maybe as spectra) and then do the data analysis in python or a language that interacts with the database. Unless your a SQL-Wizard.
To me it sounds like a usage scenario where you want something like a "relational column store" .
I may be misunderstanding the design, but if you are primarily dealing with a large collection of arrays, storing them in typical row-oriented tables means that each element is similar to a slice. If you are interested in looking at slices in a typical manner, that makes sense, but it could be less efficient if you are really looking at entire columns at a time.
When retrieving the arrays, not only might you not need to join it with another table resulting from your normalization, but you can retrieve the series as an array rather than a hash.
I really may be misunderstanding the problem, and I'm not even suggesting a specific solution.
that may be relevant, even if it isn't really a current or deployable solution.
No one has mentioned, thus my suggestion. Take a look at massively sharded MySQL solutions. For example, see this highly regarded .
The concept is:
Instead of one extra large database
Use many small ones holding parts of the original data
Thus you can scale horizontally, instead of trying to improve vertical performance. Google's
are also using cheap horizontally scalable nodes to store and query petabytes of data.
However, there will be troubles if you need to run queries over different shards.
If anyone interested, I made a hello-world sharding application a while ago. It is discussed
in a blog post. I used RavenDB and C# but the details are irrelevant and the idea is the same.
I'd recommend you try and partition your table. We have over 80 mil rows in a single table (stock market data) and have no trouble accessing it quickly.
Depending on how you intend you search your data, you should design your partitions. In our case by date works well because we query for specific dates.
Yes, but...
I've worked with tables which had 2 billion rows. However only the queries using PK were expected to be fast.
Most importantly, the hardware had enough RAM to fit whole tables in memory. When that became an issue (maxed at 96GB at that time), went for vertical partitioning, keeping size of table set on each machine small enough to still fit in memory. Also, the machines were connected via 10Gb fiber, so network throughput wasn't that much of an issue.
BTW. your schema looks like something, which could fit into NoSQL solution, using
run_id as hashing key for spectra and spectrum_id as hashing key for data points.
I've written about this topic on my blog:
To repeat some of the key points:
B-trees degrade as they get larger and do not fit into memory (MySQL is not alone here).
InnoDB does have some features to help sustain some performance ( previously called 'insert buffer').
Partitioning can also help.
In the comments of my post Tim Callaghan linked to this:
Which shows inserting 1 Billion rows using the iibench benchmark.
protected by ♦
Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10
on this site (the ).
Would you like to answer one of these
Not the answer you're looking for?
Browse other questions tagged
Database Administrators Stack Exchange works best with JavaScript enabled}

我要回帖

更多关于 329国道280km 500m 的文章

更多推荐

版权声明:文章内容来源于网络,版权归原作者所有,如有侵权请点击这里与我们联系,我们将及时删除。

点击添加站长微信